[Python-ideas] Processing surrogates in

Andrew Barnert abarnert at yahoo.com
Thu May 7 09:24:11 CEST 2015


On May 6, 2015, at 18:41, Rob Cliffe <rob.cliffe at btinternet.com> wrote:
> 
> This is no doubt not the best platform to raise these thoughts (which are nothing to do with Python - apologies), but I'm not sure where else to go.
> I watch discussions like this ...
> I watch posts like this one [Nick's] ...
> ...  And I despair.  I really despair.
> 
> I am a very experienced but old (some would say "dinosaur") programmer.
> I appreciate the need for Unicode.  I really do.
> I don't understand Unicode and all its complications AT ALL.
> And I can't help wondering:
>     Why, oh why, do things have to be SO FU*****G COMPLICATED?  This thread, for example, is way over my head.  And it is typical of many discussions I have stared at, uncomprehendingly.
> Surely 65536 (2-byte) encodings are enough to express all characters in all the languages in the world, plus all the special characters we need.

Ironically, that idea is exactly why there are problems even within the "all-Unicode" world where cp1252 and Big5 and Shift-JIS don't exist.

Apple, Microsoft, Sun, and a few other vendors jumped on the Unicode bandwagon early and committed themselves to the idea that 2 bytes is enough for everything. When the world discovered that wasn't true, we were stuck with a bunch of APIs that insisted on 2 bytes. Apple was able to partly make a break with that era, but Windows and Java are completely stuck with "Unicode means 16-bit" forever, which is why the whole world is stuck dealing with UTF-16 and surrogates forever.

> Why can't there be just ONE universal encoding?  

There is, UTF-8.

Except sometimes you have algorithms that require fixed width, so you need UTF-32.

And Java and Windows need UTF-16.

And a few Internet protocols need UTF-7.

And DNS needs a sort-of-UTF-5 called IDNA.

At least everything else can die. Once every document stored in an old IBM code page or similar gets transliterated or goes away. Unfortunately, there are still people creating cp1252 documents every day on brand-new Windows desktops (and there are still people creating filenames on Latin-1 filesystems on older Linux and Unix boxes, but that's dying out a lot faster), so who knows when that day will come. Python can't force it. Even the Unicode committee can't force it (especially since Microsoft is one of the most active members).

> (Decided upon, no doubt, by some international standards committee. There would surely be enough spare codes for any special characters etc. that might come up in the foreseeable future.)
> 
> Is it just historical accident (partly due to an awkward move from 1-byte ASCII to 2-byte Unicode, implemented in many different places, in many different ways) that we now have a patchwork of encodings that we strive to fit into some over-complicated scheme?

UTF-16 is a historical accident, and UTF-7 and IDNA. And all of the non-Unicode encodings, even more so.

> Or is there really some fundamental reason why things can't be simpler?  (Like, REALLY, REALLY simple?)

We really do need at least UTF-8 and UTF-32. But that's it. And I think that's simple enough.

> Imageine if we were starting to design the 21st century from scratch, throwing away all the history?  How would we go about it?

If we could start over with a clean slate today, I'm pretty sure we would have just one character set, Unicode, and two encodings, UTF-8 and UTF-32, and everyone would be happy (except for a small group in Japan who insist TRON's text model is better, but we can ignore them).

In particular, this would mean that in Python, a bytes is either UTF-8, or not text. No need to specify codecs or error handlers, no surrogates (and definitely no surrogate escapes), etc.

Plus, we'd have no daylight savings time, no changing timezone boundaries, seamless PyPI failovers, sensible drug laws, cars that run forever using garbage as fuel, no war, no crime, and Netflix would never remove a season when you're on episode 11 out of 13. (Unfortunately, we would still have perl. I don't know why, but I know we would.)

> (Maybe I'm just naive, but sometimes ... Out of the mouths of babes and sucklings.)
> Aaaargh!  Do I really have to learn all this mumbo-jumbo?!  (Forgive me. :-) )
> I would be grateful for any enlightenment - thanks in advance.
> Rob Cliffe
> 
> 
>> On 05/05/2015 20:21, Nick Coghlan wrote:
>>> On 5 May 2015 at 18:23, Stephen J. Turnbull <stephen at xemacs.org> wrote:
>>> So this proposal merely amounts to reintroduction of the Python 2 str
>>> confusion into Python 3.  It is dangerous *precisely because* the
>>> current situation is so frustrating.  These functions will not be used
>>> by "consenting adults", in most cases.  Those with sufficient
>>> knowledge for "informed consent" also know enough to decode encoded
>>> text ASAP, and encode internal text ALAP, with appropriate handlers,
>>> in the first place.
>>> 
>>> Rather, these str2str functions will be used by programmers at the
>>> ends of their ropes desperate to suppress "those damned Unicode
>>> errors" by any means available.  In fact, they are most likely to be
>>> used and recommended by *library* writers, because they're the ones
>>> who are least like to have control over input, or to know their
>>> clients' requirements for output.  "Just use rehandle_* to ameliorate
>>> the errors" is going to be far too tempting for them to resist.
>> The primary intended audience is Linux distribution developers using
>> Python 3 as the system Python. I agree misuse in other contexts is a
>> risk, but consider assisting the migration of the Linux ecosystem from
>> Python 2 to Python 3 sufficiently important that it's worth our while
>> taking that risk.
>> 
>>> That Nick, of all people, supports this proposal is to me just
>>> confirmation that it's frustration, and only frustration, speaking
>>> here.  He used to be one of the strongest supporters of keeping
>>> "native text" (Unicode) and "encoded text" separate by keeping the
>>> latter in bytes.
>> It's not frustration (at least, I don't think it is), it's a proposal
>> for advanced tooling to deal properly with legacy *nix systems that
>> either:
>> 
>> a. use a locale encoding other than UTF-8; or
>> b. don't reliably set the locale encoding for system services and cron
>> jobs (which anecdotally appears to amount to "aren't using systemd" in
>> the current crop of *nix init systems)
>> 
>> If a developer only cares about Windows, Mac OS X, or modern systemd
>> based *nix systems that use UTF-8 as the system locale, and they never
>> set "LANG=C" before running a Python program, then these new functions
>> will be completely irrelevant to them. (I've also submitted a request
>> to the glibc team to make C.UTF-8 universally available, reducing the
>> need to use "LANG=C", and they're amenable to the idea, but it
>> requires someone to work on preparing and submitting a patch:
>> https://sourceware.org/bugzilla/show_bug.cgi?id=17318)
>> 
>> If, however, a developer wants to handle "LANG=C", or other non-UTF-8
>> locales reliably across the full spectrum of *nix systems in Python 3,
>> they need a way to cope with system data that they *know* has been
>> decoded incorrectly by the interpreter, as we'll potentially do
>> exactly that for environment variables, command line arguments,
>> stdin/stdout/stderr and more if we get bad locale encoding settings
>> from the OS (such as when "LANG=C" is specified, or the init system
>> simply doesn't set a locale at all and hence CPython falls back to the
>> POSIX default of ASCII).
>> 
>> Python 2 lets users sweep a lot of that under the rug, as the data at
>> least round trips within the system, but you get unexpected mojibake
>> in some cases (especially when taking local data and pushing it out
>> over the network).
>> 
>> Since these boundary decoding issues don't arise on properly
>> configured modern *nix systems, we've been able to take advantage of
>> that by moving Python 3 towards a more pragmatic and distro-friendly
>> approach in coping with legacy *nix platforms and behaviours,
>> primarily by starting to use "surrogateescape" by default on a few
>> more system interfaces (e.g. on the standard streams when the OS
>> *claims* that the locale encoding is ASCII, which we now assume to
>> indicate a configuration error, which we can at least work around for
>> roundtripping purposes so that "os.listdir()" works reliably at the
>> interactive prompt).
>> 
>> This change in approach (heavily influenced by the parallel "Python 3
>> as the default system Python" efforts in Ubuntu and Fedora) *has*
>> moved us back towards an increased risk of introducing mojibake in
>> legacy environments, but the nature of that trade-off has changed
>> markedly from the situation back in 2009 (let alone 2006):
>> 
>> * most popular modern Linux systems use systemd with the UTF-8 locale,
>> which "just works" from a boundary encoding/decoding perspective (it's
>> closely akin to the situation we've had on Mac OS X from the dawn of
>> Python 3)
>> * even without systemd, most modern *nix systems at least default to
>> the UTF-8 locale, which works reliably for user processes in the
>> absence of an explicit setting like "LANG=C", even if service daemons
>> and cron jobs can be a bit sketchier in terms of the locale settings
>> they receive
>> * for legacy environments migrating from Python 2 without upgrading
>> the underlying OS, our emphasis has shifted to tolerating "bug
>> compatibility" at the Python level in order to ease migration, as the
>> most appropriate long term solution for those environments is now to
>> upgrade their OS such that it more reliably provides correct locale
>> encoding settings to the Python 3 interpreter (which wasn't a
>> generally available option back when Python 3 first launched)
>> 
>> Armin Ronacher (as ever) provides a good explanation of the system
>> interface problems that can arise in Python 3 with bad locale encoding
>> settings here: http://click.pocoo.org/4/python3/#python3-surrogates
>> 
>> In my view, the critical helper function for this purpose is actually
>> "handle_surrogateescape", as that's the one that lets us readily adapt
>> from the incorrectly specified ASCII locale encoding to any other
>> ASCII-compatible system encoding once we've bootstrapped into a full
>> Python environment which has more options for figuring out a suitable
>> encoding than just looking at the locale setting provided by the C
>> runtime. It's also the function that serves to provide the primary
>> "hook" where we can hang documentation of this platform specific
>> boundary encoding/decoding issue.
>> 
>> The other suggested functions are then more about providing a "peek
>> behind the curtain" API for folks that want to *use Python* to explore
>> some of the ins and outs of Unicode surrogate handling. Surrogates and
>> astrals really aren't that complicated, but we've historically hidden
>> them away as "dark magic not to be understood by mere mortals". In
>> reality, they're just different ways of composing sequences of
>> integers to represent text, and the suggested APIs are designed to
>> expose that in a way we haven't done in the past. I can't actually
>> think of a practical purpose for them other than teaching people the
>> basics of how Unicode representations work, but demystifying that
>> seems sufficiently worthwhile to me that I'm not opposed to their
>> inclusion (bear in mind I'm also the current "dis" module maintainer,
>> and a contributor to the "inspect", so I'm a big fan of exposing
>> underlying concepts like this in a way that lets people play with them
>> programmatically for learning purposes).
>> 
>> Cheers,
>> Nick.
>> 
> 
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150507/558bf6ad/attachment-0001.html>


More information about the Python-ideas mailing list