[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
glyph at divmod.com
glyph at divmod.com
Thu Apr 23 00:49:30 CEST 2009
On 07:17 pm, martin at v.loewis.de wrote:
>>-1. On UNIX, character data is not sufficient to represent paths. We
>>must, must, must continue to have a simple bytes interface to these
>>APIs.
>I'd like to respond to this concern in three ways:
>
>1. The PEP doesn't remove any of the existing interfaces. So if the
> interfaces for byte-oriented file names in 3.0 work fine for you,
> feel free to continue to use them.
It's good to know this. It would be good if the PEP made it clear that
it is proposing an additional way to work with undecodable bytes, not
replacing the existing one.
For me, this PEP isn't an acceptable substitute for direct bytes-based
access to command-line arguments and environment variables on UNIX. To
my knowledge *those* APIs still don't exist yet. I would like it if
this PEP were not used as an excuse to avoid adding them.
>2. Even if they were taken away (which the PEP does not propose to do),
> it would be easy to emulate them for applications that want them.
I think this is a pretty clear abstraction inversion. Luckily nobody is
proposing it :).
>3. I still disagree that we must, must, must continue to provide these
> interfaces.
You do have a point; if there is a clean, defined mapping between str
and bytes in terms of all path/argv/environ APIs, then we don't *need*
those APIs, since we can just implement them in terms of characters.
But I still think that's a bad idea, since mixing the returned strings
with *other* APIs remains problematic. However, I still think the
mapping you propose is problematic...
> I don't understand from the rest of your message what
> would *actually* break if people would use the proposed interfaces.
As far as more concrete problems: the utf-8 codec currently in python
2.5 and 2.6, and 3.0 will happily encode half-surrogates, at least in
the builds I have.
>>> '\udc81'.encode('utf-8').decode('utf-8')
'\udc81'
So there's an ambiguity when passing U+DC81 to this codec: do you mean
\xed\xb2\x81 or do you just mean \x81? Of course it would be possible
to make UTF-8B consistent in this regard, but it is still going to
interact with code that thinks in terms of actual UTF-8, and the failure
mode here is very difficult to inspect.
A major problem here is that it's very difficult to puzzle out whether
anything *will* actually break. I might be wrong about the above for
some subtlety of unicode that I don't quite understand, but I don't want
to spend all day experimenting with every possible set of build options,
python versions, and unicode specifications. Neither, I wager, do most
people who want to call listdir().
Another specific problem: looking at the Character Map application on my
desktop, U+F0126 and U+F0127 are considered printable characters. I'm
not sure what they're supposed to be, exactly, but there are glyphs
there. This is running Ubuntu 8.04; there may be more of these in use
in more recent version of GNOME.
There is nothing "private" about the "private use" area; Python can
never use any of these characters for *anything*, except possibly
internally in ways which are never exposed to application code, because
the operating system (or window system, or libraries) might use them.
If I pass a string with those printable PUA/A characters in it to
listdir(), what happens? Do they get turned into bytes, do they only
get turned into bytes if my filesystem encoding happens to be something
other than UTF-8...?
The PEP seems a bit ambiguous to me as far as how the PUA hack and the
half-surrogate hack interact. I could be wrong, but it seems to me to
be an either-or proposition, in which case there would be *four* bytes
types in python 3.1: bytes, bytearray, str-with-PUA/A-junk, str-with-
half-surrogate-junk. Detecting the difference would be an expensive and
subtle affair; the simplest solution I could think of would be to use an
error-prone regex. If the encoding hack used were simply NULL, then the
detection would be straightforward: "if '\u0000' in thingy:".
Ultimately I think I'm only -0 on all of this now, as long as we get
bytes versions of environ and argv. Even if these corner-case issues
aren't fixed, those of us who want to have correct handling of
undecodable filenames can do so.
More information about the Python-Dev
mailing list