[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Wed Apr 22 14:20:24 CEST 2009

On 06:50 am, martin at v.loewis.de wrote:
>I'm proposing the following PEP for inclusion into Python 3.1.
>Please comment.

>To convert non-decodable bytes, a new error handler "python-escape" is
>introduced, which decodes non-decodable bytes using into a private-use
>character U+F01xx, which is believed to not conflict with private-use
>characters that currently exist in Python codecs.

-1.  On UNIX, character data is not sufficient to represent paths.  We 
must, must, must continue to have a simple bytes interface to these 
APIs.  Covering it up in layers of obscure encoding hacks will not make 
the problem go away, it will just make it harder to understand.

To make matters worse, Linux and GNOME use the PUA for some printable 
characters.  If you open up charmap on an ubuntu system and select "view 
by unicode character block", then click on "private use area", you'll 
see many of these.  I know that Apple uses at least a few PUA codepoints 
for the apple logo and the propeller/option icons as well.

I am still -1 on any turn-non-decodable-bytes-into-text, because it 
makes life harder for those of us trying to keep bytes and text 
straight, but if you absolutely must represent POSIX filenames as 
mojibake rather than bytes, the only workable solution is to use NUL as 
your escape character.  That's the only code point which _actually_ 
can't show up in a filename somehow.  As we discussed last time, this is 
what Mono does with System.IO.Path.  As a bonus, it's _much_ easier to 
detect a NUL from random application code than to try to figure out if a 
string has any half-surrogates or magic PUA characters which shouldn't 
be interpreted according to platform PUA rules.