[Python-ideas] Processing surrogates in

Thu May 7 09:31:09 CEST 2015

On 7 May 2015 at 17:55, Nick Coghlan <ncoghlan at gmail.com> wrote:
> On 7 May 2015 at 15:27, Nick Coghlan <ncoghlan at gmail.com> wrote:
>> On 7 May 2015 at 11:41, Rob Cliffe <rob.cliffe at btinternet.com> wrote:
>>> Or is there really some fundamental reason why things can't be simpler?
>>> (Like, REALLY, REALLY simple?)
>>
>> Yep, there are around 7 billion fundamental reasons currently alive,
>> and I have no idea how many that have gone before us: humans :)
>
> Heh, a message from Stephen off-list made me realise that an info dump
> of all the reasons the edge cases are hard probably wasn't a good way
> to answer your question :)
>
> What "we're" working towards (where "we" ~= the Unicode consortium +
> operating system designers + programming language designers) is a
> world where everything "just works", and computers talk to humans in
> each human's preferred language (or a collection of languages,
> depending on what the human is doing), and to each other in Unicode.
> There are then a whole host of technical and political reasons why
> it's taking decades to get from the historical point A (where
> computers talk to humans in at most one language at a time, and don't
> talk to each other at all) to that desired point B.
>
> We'll know we're done with that transition when Unicode becomes almost
> transparently invisible, and the vast majority of programmers are once
> again able to just deal with "text" without worrying too much about
> how it's represented internally (but also having their programs be
> readily usable in language's other than their own).
>
> Python 3 is already a lot closer to that ideal than Python 2 was, but
> there are still some rough edges to iron out. The ones I'm personally
> aware of affecting 3.4+ (including the one Serhiy started this thread
> about) are listed as dependencies of http://bugs.python.org/issue22555

So, just last week I had to teach pbr how to deal with git commit
messages that are not utf8 decodable.

Some of the lowest layers of our stacks are willfully hostile to utf8:

 - Linux itself refuses to consider paths to be anything other than
octet sequences
   [for various reasons, one of which is that it would be a backwards
compatibility break to stop handling non-unicode strings, and Linux
reallllllly doesn't want to do that, because you'd immediately make
some % of data worldwide inaccessible].
 - libc is somewhat, but not a lot better - its constrained by Linux
 - git considers commit messages to be octet sequences, and file paths likewise
   [for much the same reason as Linux: existing repositories have the
data in them, API break to reject it]

bzr refused non-unicode paths from day one, and we had a steady stream
of users reporting that they couldn't import their history into bzr.
One common reason is that they had test data in files on disk that was
deliberately non-unicode (e.g. they were testing unicode handling
boundary conditions in their software). Overall I believe we made the
right choice, because we had relatively little in the way of headaches
on Windows and MacOSX. [The most we ran into was the case insanity,
plus normalisation forms on MacOSX].

surrogate escaping is a clever hack, and while the underlying layers
are staunchly willing to give us crap data, we have a fairly simple
choice:
 - either accept that under some circumstances folk will have to do
their own interop shim at the boundary or
 - do the surrogate escaping hack to centralise the interop shims.

The big risk, as already pointed out, is that the interop shims can at
most get you mojibake rather than a crash. This isn't a win, its not
even beneficial.

I am not at all convinced by the distributor and packaging migration
to Python3 argument. They have 'python3 -u' available for writing
utilities that may be given mojibake input *and be expected to work
regardless*. That lets Python3 get up and started and they can choose
their own approach to handling the awful: they can just work in
bytestrings, never decoding; they can explicitly decode with
surrogateescape; they can write their own tooling.

-Rob

-- 
Robert Collins <rbtcollins at hp.com>
Distinguished Technologist
HP Converged Cloud