[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Thu Apr 23 00:49:30 CEST 2009

On 07:17 pm, martin at v.loewis.de wrote:
>>-1.  On UNIX, character data is not sufficient to represent paths.  We
>>must, must, must continue to have a simple bytes interface to these
>>APIs.

>I'd like to respond to this concern in three ways:
>
>1. The PEP doesn't remove any of the existing interfaces. So if the
>   interfaces for byte-oriented file names in 3.0 work fine for you,
>   feel free to continue to use them.

It's good to know this.  It would be good if the PEP made it clear that 
it is proposing an additional way to work with undecodable bytes, not 
replacing the existing one.

For me, this PEP isn't an acceptable substitute for direct bytes-based 
access to command-line arguments and environment variables on UNIX.  To 
my knowledge *those* APIs still don't exist yet.  I would like it if 
this PEP were not used as an excuse to avoid adding them.
>2. Even if they were taken away (which the PEP does not propose to do),
>   it would be easy to emulate them for applications that want them.

I think this is a pretty clear abstraction inversion.  Luckily nobody is 
proposing it :).
>3. I still disagree that we must, must, must continue to provide these
>   interfaces.

You do have a point; if there is a clean, defined mapping between str 
and bytes in terms of all path/argv/environ APIs, then we don't *need* 
those APIs, since we can just implement them in terms of characters. 
But I still think that's a bad idea, since mixing the returned strings 
with *other* APIs remains problematic.  However, I still think the 
mapping you propose is problematic...
>   I don't understand from the rest of your message what
>   would *actually* break if people would use the proposed interfaces.

As far as more concrete problems: the utf-8 codec currently in python 
2.5 and 2.6, and 3.0 will happily encode half-surrogates, at least in 
the builds I have.

    >>> '\udc81'.encode('utf-8').decode('utf-8')
    '\udc81'

So there's an ambiguity when passing U+DC81 to this codec: do you mean 
\xed\xb2\x81 or do you just mean \x81?  Of course it would be possible 
to make UTF-8B consistent in this regard, but it is still going to 
interact with code that thinks in terms of actual UTF-8, and the failure 
mode here is very difficult to inspect.

A major problem here is that it's very difficult to puzzle out whether 
anything *will* actually break.  I might be wrong about the above for 
some subtlety of unicode that I don't quite understand, but I don't want 
to spend all day experimenting with every possible set of build options, 
python versions, and unicode specifications.  Neither, I wager, do most 
people who want to call listdir().

Another specific problem: looking at the Character Map application on my 
desktop, U+F0126 and U+F0127 are considered printable characters.  I'm 
not sure what they're supposed to be, exactly, but there are glyphs 
there.  This is running Ubuntu 8.04; there may be more of these in use 
in more recent version of GNOME.

There is nothing "private" about the "private use" area; Python can 
never use any of these characters for *anything*, except possibly 
internally in ways which are never exposed to application code, because 
the operating system (or window system, or libraries) might use them. 
If I pass a string with those printable PUA/A characters in it to 
listdir(), what happens?  Do they get turned into bytes, do they only 
get turned into bytes if my filesystem encoding happens to be something 
other than UTF-8...?

The PEP seems a bit ambiguous to me as far as how the PUA hack and the 
half-surrogate hack interact.  I could be wrong, but it seems to me to 
be an either-or proposition, in which case there would be *four* bytes 
types in python 3.1: bytes, bytearray, str-with-PUA/A-junk, str-with- 
half-surrogate-junk.  Detecting the difference would be an expensive and 
subtle affair; the simplest solution I could think of would be to use an 
error-prone regex.  If the encoding hack used were simply NULL, then the 
detection would be straightforward: "if '\u0000' in thingy:".

Ultimately I think I'm only -0 on all of this now, as long as we get 
bytes versions of environ and argv.  Even if these corner-case issues 
aren't fixed, those of us who want to have correct handling of 
undecodable filenames can do so.