[Python-ideas] Processing surrogates in

Tue May 5 21:21:37 CEST 2015

On 5 May 2015 at 18:23, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> So this proposal merely amounts to reintroduction of the Python 2 str
> confusion into Python 3.  It is dangerous *precisely because* the
> current situation is so frustrating.  These functions will not be used
> by "consenting adults", in most cases.  Those with sufficient
> knowledge for "informed consent" also know enough to decode encoded
> text ASAP, and encode internal text ALAP, with appropriate handlers,
> in the first place.
>
> Rather, these str2str functions will be used by programmers at the
> ends of their ropes desperate to suppress "those damned Unicode
> errors" by any means available.  In fact, they are most likely to be
> used and recommended by *library* writers, because they're the ones
> who are least like to have control over input, or to know their
> clients' requirements for output.  "Just use rehandle_* to ameliorate
> the errors" is going to be far too tempting for them to resist.

The primary intended audience is Linux distribution developers using
Python 3 as the system Python. I agree misuse in other contexts is a
risk, but consider assisting the migration of the Linux ecosystem from
Python 2 to Python 3 sufficiently important that it's worth our while
taking that risk.

> That Nick, of all people, supports this proposal is to me just
> confirmation that it's frustration, and only frustration, speaking
> here.  He used to be one of the strongest supporters of keeping
> "native text" (Unicode) and "encoded text" separate by keeping the
> latter in bytes.

It's not frustration (at least, I don't think it is), it's a proposal
for advanced tooling to deal properly with legacy *nix systems that
either:

a. use a locale encoding other than UTF-8; or
b. don't reliably set the locale encoding for system services and cron
jobs (which anecdotally appears to amount to "aren't using systemd" in
the current crop of *nix init systems)

If a developer only cares about Windows, Mac OS X, or modern systemd
based *nix systems that use UTF-8 as the system locale, and they never
set "LANG=C" before running a Python program, then these new functions
will be completely irrelevant to them. (I've also submitted a request
to the glibc team to make C.UTF-8 universally available, reducing the
need to use "LANG=C", and they're amenable to the idea, but it
requires someone to work on preparing and submitting a patch:
https://sourceware.org/bugzilla/show_bug.cgi?id=17318)

If, however, a developer wants to handle "LANG=C", or other non-UTF-8
locales reliably across the full spectrum of *nix systems in Python 3,
they need a way to cope with system data that they *know* has been
decoded incorrectly by the interpreter, as we'll potentially do
exactly that for environment variables, command line arguments,
stdin/stdout/stderr and more if we get bad locale encoding settings
from the OS (such as when "LANG=C" is specified, or the init system
simply doesn't set a locale at all and hence CPython falls back to the
POSIX default of ASCII).

Python 2 lets users sweep a lot of that under the rug, as the data at
least round trips within the system, but you get unexpected mojibake
in some cases (especially when taking local data and pushing it out
over the network).

Since these boundary decoding issues don't arise on properly
configured modern *nix systems, we've been able to take advantage of
that by moving Python 3 towards a more pragmatic and distro-friendly
approach in coping with legacy *nix platforms and behaviours,
primarily by starting to use "surrogateescape" by default on a few
more system interfaces (e.g. on the standard streams when the OS
*claims* that the locale encoding is ASCII, which we now assume to
indicate a configuration error, which we can at least work around for
roundtripping purposes so that "os.listdir()" works reliably at the
interactive prompt).

This change in approach (heavily influenced by the parallel "Python 3
as the default system Python" efforts in Ubuntu and Fedora) *has*
moved us back towards an increased risk of introducing mojibake in
legacy environments, but the nature of that trade-off has changed
markedly from the situation back in 2009 (let alone 2006):

* most popular modern Linux systems use systemd with the UTF-8 locale,
which "just works" from a boundary encoding/decoding perspective (it's
closely akin to the situation we've had on Mac OS X from the dawn of
Python 3)
* even without systemd, most modern *nix systems at least default to
the UTF-8 locale, which works reliably for user processes in the
absence of an explicit setting like "LANG=C", even if service daemons
and cron jobs can be a bit sketchier in terms of the locale settings
they receive
* for legacy environments migrating from Python 2 without upgrading
the underlying OS, our emphasis has shifted to tolerating "bug
compatibility" at the Python level in order to ease migration, as the
most appropriate long term solution for those environments is now to
upgrade their OS such that it more reliably provides correct locale
encoding settings to the Python 3 interpreter (which wasn't a
generally available option back when Python 3 first launched)

Armin Ronacher (as ever) provides a good explanation of the system
interface problems that can arise in Python 3 with bad locale encoding
settings here: http://click.pocoo.org/4/python3/#python3-surrogates

In my view, the critical helper function for this purpose is actually
"handle_surrogateescape", as that's the one that lets us readily adapt
from the incorrectly specified ASCII locale encoding to any other
ASCII-compatible system encoding once we've bootstrapped into a full
Python environment which has more options for figuring out a suitable
encoding than just looking at the locale setting provided by the C
runtime. It's also the function that serves to provide the primary
"hook" where we can hang documentation of this platform specific
boundary encoding/decoding issue.

The other suggested functions are then more about providing a "peek
behind the curtain" API for folks that want to *use Python* to explore
some of the ins and outs of Unicode surrogate handling. Surrogates and
astrals really aren't that complicated, but we've historically hidden
them away as "dark magic not to be understood by mere mortals". In
reality, they're just different ways of composing sequences of
integers to represent text, and the suggested APIs are designed to
expose that in a way we haven't done in the past. I can't actually
think of a practical purpose for them other than teaching people the
basics of how Unicode representations work, but demystifying that
seems sufficiently worthwhile to me that I'm not opposed to their
inclusion (bear in mind I'm also the current "dis" module maintainer,
and a contributor to the "inspect", so I'm a big fan of exposing
underlying concepts like this in a way that lets people play with them
programmatically for learning purposes).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia