[Python-ideas] Processing surrogates in

Wed May 6 06:00:29 CEST 2015

On May 5, 2015, at 12:21, Nick Coghlan <ncoghlan at gmail.com> wrote:
> 
>> On 5 May 2015 at 18:23, Stephen J. Turnbull <stephen at xemacs.org> wrote:
>> So this proposal merely amounts to reintroduction of the Python 2 str
>> confusion into Python 3.  It is dangerous *precisely because* the
>> current situation is so frustrating.  These functions will not be used
>> by "consenting adults", in most cases.  Those with sufficient
>> knowledge for "informed consent" also know enough to decode encoded
>> text ASAP, and encode internal text ALAP, with appropriate handlers,
>> in the first place.
>> 
>> Rather, these str2str functions will be used by programmers at the
>> ends of their ropes desperate to suppress "those damned Unicode
>> errors" by any means available.  In fact, they are most likely to be
>> used and recommended by *library* writers, because they're the ones
>> who are least like to have control over input, or to know their
>> clients' requirements for output.  "Just use rehandle_* to ameliorate
>> the errors" is going to be far too tempting for them to resist.
> 
> The primary intended audience is Linux distribution developers using
> Python 3 as the system Python. I agree misuse in other contexts is a
> risk, but consider assisting the migration of the Linux ecosystem from
> Python 2 to Python 3 sufficiently important that it's worth our while
> taking that risk.

In this case, the "unfortunate" fact that all these functions have to be "buried" in codecs instead of more discoverable sounds like a _good_ thing, not a problem. The Fedora and Ubuntu people will know where to find them, other linux distros will follow their lead, and the kind of end-user developers that Stephen is worried about who just like to throw in random encode and decode calls until their one test case on their one machine works will never even notice them and will still be encouraged to actually do the right thing.

>> That Nick, of all people, supports this proposal is to me just
>> confirmation that it's frustration, and only frustration, speaking
>> here.  He used to be one of the strongest supporters of keeping
>> "native text" (Unicode) and "encoded text" separate by keeping the
>> latter in bytes.
> 
> It's not frustration (at least, I don't think it is), it's a proposal
> for advanced tooling to deal properly with legacy *nix systems that
> either:
> 
> a. use a locale encoding other than UTF-8; or
> b. don't reliably set the locale encoding for system services and cron
> jobs (which anecdotally appears to amount to "aren't using systemd" in
> the current crop of *nix init systems)

It seems like launchd systems are as good as systemd systems here. Or are you not considering OS X a *nix?

I suppose given than the timeline for Apple to switch to Python 3 as the default Python is "maybe it'll happen, but we'll never tell you until a month before the public beta", it isn't really all that relevant...

> If a developer only cares about Windows, Mac OS X, or modern systemd
> based *nix systems that use UTF-8 as the system locale, and they never
> set "LANG=C" before running a Python program, then these new functions
> will be completely irrelevant to them. (I've also submitted a request
> to the glibc team to make C.UTF-8 universally available, reducing the
> need to use "LANG=C", and they're amenable to the idea, but it
> requires someone to work on preparing and submitting a patch:
> https://sourceware.org/bugzilla/show_bug.cgi?id=17318)
> 
> If, however, a developer wants to handle "LANG=C", or other non-UTF-8
> locales reliably across the full spectrum of *nix systems in Python 3,
> they need a way to cope with system data that they *know* has been
> decoded incorrectly by the interpreter, as we'll potentially do
> exactly that for environment variables, command line arguments,
> stdin/stdout/stderr and more if we get bad locale encoding settings
> from the OS (such as when "LANG=C" is specified, or the init system
> simply doesn't set a locale at all and hence CPython falls back to the
> POSIX default of ASCII).
> 
> Python 2 lets users sweep a lot of that under the rug, as the data at
> least round trips within the system, but you get unexpected mojibake
> in some cases (especially when taking local data and pushing it out
> over the network).
> 
> Since these boundary decoding issues don't arise on properly
> configured modern *nix systems, we've been able to take advantage of
> that by moving Python 3 towards a more pragmatic and distro-friendly
> approach in coping with legacy *nix platforms and behaviours,
> primarily by starting to use "surrogateescape" by default on a few
> more system interfaces (e.g. on the standard streams when the OS
> *claims* that the locale encoding is ASCII, which we now assume to
> indicate a configuration error, which we can at least work around for
> roundtripping purposes so that "os.listdir()" works reliably at the
> interactive prompt).
> 
> This change in approach (heavily influenced by the parallel "Python 3
> as the default system Python" efforts in Ubuntu and Fedora) *has*
> moved us back towards an increased risk of introducing mojibake in
> legacy environments, but the nature of that trade-off has changed
> markedly from the situation back in 2009 (let alone 2006):
> 
> * most popular modern Linux systems use systemd with the UTF-8 locale,
> which "just works" from a boundary encoding/decoding perspective (it's
> closely akin to the situation we've had on Mac OS X from the dawn of
> Python 3)
> * even without systemd, most modern *nix systems at least default to
> the UTF-8 locale, which works reliably for user processes in the
> absence of an explicit setting like "LANG=C", even if service daemons
> and cron jobs can be a bit sketchier in terms of the locale settings
> they receive
> * for legacy environments migrating from Python 2 without upgrading
> the underlying OS, our emphasis has shifted to tolerating "bug
> compatibility" at the Python level in order to ease migration, as the
> most appropriate long term solution for those environments is now to
> upgrade their OS such that it more reliably provides correct locale
> encoding settings to the Python 3 interpreter (which wasn't a
> generally available option back when Python 3 first launched)
> 
> Armin Ronacher (as ever) provides a good explanation of the system
> interface problems that can arise in Python 3 with bad locale encoding
> settings here: http://click.pocoo.org/4/python3/#python3-surrogates
> 
> In my view, the critical helper function for this purpose is actually
> "handle_surrogateescape", as that's the one that lets us readily adapt
> from the incorrectly specified ASCII locale encoding to any other
> ASCII-compatible system encoding once we've bootstrapped into a full
> Python environment which has more options for figuring out a suitable
> encoding than just looking at the locale setting provided by the C
> runtime. It's also the function that serves to provide the primary
> "hook" where we can hang documentation of this platform specific
> boundary encoding/decoding issue.
> 
> The other suggested functions are then more about providing a "peek
> behind the curtain" API for folks that want to *use Python* to explore
> some of the ins and outs of Unicode surrogate handling. Surrogates and
> astrals really aren't that complicated, but we've historically hidden
> them away as "dark magic not to be understood by mere mortals".

I thought most linux 2.x system pythons were wide builds, and there definitely aren't any UTF-16 system interfaces like there are on Windows (which misleadingly calls them "Unicode", which we abet by not making people .encode('utf-16') in some of the places where they'd have to .encode('utf-8') on Mac and Linux...).

So I'm surprised there's a problem here at all. The only issues a Linux user is likely to ever see should be with surrogate escapes, not real surrogates, right?

> In
> reality, they're just different ways of composing sequences of
> integers to represent text, and the suggested APIs are designed to
> expose that in a way we haven't done in the past. I can't actually
> think of a practical purpose for them other than teaching people the
> basics of how Unicode representations work, but demystifying that
> seems sufficiently worthwhile to me that I'm not opposed to their
> inclusion (bear in mind I'm also the current "dis" module maintainer,
> and a contributor to the "inspect", so I'm a big fan of exposing
> underlying concepts like this in a way that lets people play with them
> programmatically for learning purposes).
> 
> Cheers,
> Nick.
> 
> -- 
> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/