[Python-ideas] Processing surrogates in

Rob Cliffe rob.cliffe at btinternet.com
Thu May 7 03:41:34 CEST 2015


This is no doubt *not* the best platform to raise these thoughts (which 
are nothing to do with Python - apologies), but I'm not sure where else 
to go.
I watch discussions like this ...
I watch posts like this one [Nick's] ...
...  And I despair.  I really despair.

I am a very experienced but old (some would say "dinosaur") programmer.
I appreciate the need for Unicode.  I really do.
I don't understand Unicode and all its complications AT ALL.
And I can't help wondering:
     Why, oh why, do things have to be SO FU*****G COMPLICATED?  This 
thread, for example, is way over my head.  And it is typical of many 
discussions I have stared at, uncomprehendingly.
Surely 65536 (2-byte) encodings are enough to express all characters in 
all the languages in the world, plus all the special characters we need.
Why can't there be just *ONE* universal encoding?  (Decided upon, no 
doubt, by some international standards committee. There would surely be 
enough spare codes for any special characters etc. that might come up in 
the foreseeable future.)

*Is it just historical accident* (partly due to an awkward move from 
1-byte ASCII to 2-byte Unicode, implemented in many different places, in 
many different ways) *that we now have a patchwork of encodings that we 
strive to fit into some over-complicated scheme*?
Or is there *really* some *fundamental reason* why things *can't* be 
simpler?  (Like, REALLY, _*REALLY*_ simple?)
Imageine if we were starting to design the 21st century from scratch, 
throwing away all the history?  How would we go about it?
(Maybe I'm just naive, but sometimes ... Out of the mouths of babes and 
sucklings.)
Aaaargh!  Do I really have to learn all this mumbo-jumbo?!  (Forgive me. 
:-) )
I would be grateful for any enlightenment - thanks in advance.
Rob Cliffe


On 05/05/2015 20:21, Nick Coghlan wrote:
> On 5 May 2015 at 18:23, Stephen J. Turnbull <stephen at xemacs.org> wrote:
>> So this proposal merely amounts to reintroduction of the Python 2 str
>> confusion into Python 3.  It is dangerous *precisely because* the
>> current situation is so frustrating.  These functions will not be used
>> by "consenting adults", in most cases.  Those with sufficient
>> knowledge for "informed consent" also know enough to decode encoded
>> text ASAP, and encode internal text ALAP, with appropriate handlers,
>> in the first place.
>>
>> Rather, these str2str functions will be used by programmers at the
>> ends of their ropes desperate to suppress "those damned Unicode
>> errors" by any means available.  In fact, they are most likely to be
>> used and recommended by *library* writers, because they're the ones
>> who are least like to have control over input, or to know their
>> clients' requirements for output.  "Just use rehandle_* to ameliorate
>> the errors" is going to be far too tempting for them to resist.
> The primary intended audience is Linux distribution developers using
> Python 3 as the system Python. I agree misuse in other contexts is a
> risk, but consider assisting the migration of the Linux ecosystem from
> Python 2 to Python 3 sufficiently important that it's worth our while
> taking that risk.
>
>> That Nick, of all people, supports this proposal is to me just
>> confirmation that it's frustration, and only frustration, speaking
>> here.  He used to be one of the strongest supporters of keeping
>> "native text" (Unicode) and "encoded text" separate by keeping the
>> latter in bytes.
> It's not frustration (at least, I don't think it is), it's a proposal
> for advanced tooling to deal properly with legacy *nix systems that
> either:
>
> a. use a locale encoding other than UTF-8; or
> b. don't reliably set the locale encoding for system services and cron
> jobs (which anecdotally appears to amount to "aren't using systemd" in
> the current crop of *nix init systems)
>
> If a developer only cares about Windows, Mac OS X, or modern systemd
> based *nix systems that use UTF-8 as the system locale, and they never
> set "LANG=C" before running a Python program, then these new functions
> will be completely irrelevant to them. (I've also submitted a request
> to the glibc team to make C.UTF-8 universally available, reducing the
> need to use "LANG=C", and they're amenable to the idea, but it
> requires someone to work on preparing and submitting a patch:
> https://sourceware.org/bugzilla/show_bug.cgi?id=17318)
>
> If, however, a developer wants to handle "LANG=C", or other non-UTF-8
> locales reliably across the full spectrum of *nix systems in Python 3,
> they need a way to cope with system data that they *know* has been
> decoded incorrectly by the interpreter, as we'll potentially do
> exactly that for environment variables, command line arguments,
> stdin/stdout/stderr and more if we get bad locale encoding settings
> from the OS (such as when "LANG=C" is specified, or the init system
> simply doesn't set a locale at all and hence CPython falls back to the
> POSIX default of ASCII).
>
> Python 2 lets users sweep a lot of that under the rug, as the data at
> least round trips within the system, but you get unexpected mojibake
> in some cases (especially when taking local data and pushing it out
> over the network).
>
> Since these boundary decoding issues don't arise on properly
> configured modern *nix systems, we've been able to take advantage of
> that by moving Python 3 towards a more pragmatic and distro-friendly
> approach in coping with legacy *nix platforms and behaviours,
> primarily by starting to use "surrogateescape" by default on a few
> more system interfaces (e.g. on the standard streams when the OS
> *claims* that the locale encoding is ASCII, which we now assume to
> indicate a configuration error, which we can at least work around for
> roundtripping purposes so that "os.listdir()" works reliably at the
> interactive prompt).
>
> This change in approach (heavily influenced by the parallel "Python 3
> as the default system Python" efforts in Ubuntu and Fedora) *has*
> moved us back towards an increased risk of introducing mojibake in
> legacy environments, but the nature of that trade-off has changed
> markedly from the situation back in 2009 (let alone 2006):
>
> * most popular modern Linux systems use systemd with the UTF-8 locale,
> which "just works" from a boundary encoding/decoding perspective (it's
> closely akin to the situation we've had on Mac OS X from the dawn of
> Python 3)
> * even without systemd, most modern *nix systems at least default to
> the UTF-8 locale, which works reliably for user processes in the
> absence of an explicit setting like "LANG=C", even if service daemons
> and cron jobs can be a bit sketchier in terms of the locale settings
> they receive
> * for legacy environments migrating from Python 2 without upgrading
> the underlying OS, our emphasis has shifted to tolerating "bug
> compatibility" at the Python level in order to ease migration, as the
> most appropriate long term solution for those environments is now to
> upgrade their OS such that it more reliably provides correct locale
> encoding settings to the Python 3 interpreter (which wasn't a
> generally available option back when Python 3 first launched)
>
> Armin Ronacher (as ever) provides a good explanation of the system
> interface problems that can arise in Python 3 with bad locale encoding
> settings here: http://click.pocoo.org/4/python3/#python3-surrogates
>
> In my view, the critical helper function for this purpose is actually
> "handle_surrogateescape", as that's the one that lets us readily adapt
> from the incorrectly specified ASCII locale encoding to any other
> ASCII-compatible system encoding once we've bootstrapped into a full
> Python environment which has more options for figuring out a suitable
> encoding than just looking at the locale setting provided by the C
> runtime. It's also the function that serves to provide the primary
> "hook" where we can hang documentation of this platform specific
> boundary encoding/decoding issue.
>
> The other suggested functions are then more about providing a "peek
> behind the curtain" API for folks that want to *use Python* to explore
> some of the ins and outs of Unicode surrogate handling. Surrogates and
> astrals really aren't that complicated, but we've historically hidden
> them away as "dark magic not to be understood by mere mortals". In
> reality, they're just different ways of composing sequences of
> integers to represent text, and the suggested APIs are designed to
> expose that in a way we haven't done in the past. I can't actually
> think of a practical purpose for them other than teaching people the
> basics of how Unicode representations work, but demystifying that
> seems sufficiently worthwhile to me that I'm not opposed to their
> inclusion (bear in mind I'm also the current "dis" module maintainer,
> and a contributor to the "inspect", so I'm a big fan of exposing
> underlying concepts like this in a way that lets people play with them
> programmatically for learning purposes).
>
> Cheers,
> Nick.
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150507/e4ec41f8/attachment-0001.html>


More information about the Python-ideas mailing list