[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
glyph at divmod.com
glyph at divmod.com
Wed Apr 22 14:20:24 CEST 2009
On 06:50 am, martin at v.loewis.de wrote:
>I'm proposing the following PEP for inclusion into Python 3.1.
>Please comment.
>To convert non-decodable bytes, a new error handler "python-escape" is
>introduced, which decodes non-decodable bytes using into a private-use
>character U+F01xx, which is believed to not conflict with private-use
>characters that currently exist in Python codecs.
-1. On UNIX, character data is not sufficient to represent paths. We
must, must, must continue to have a simple bytes interface to these
APIs. Covering it up in layers of obscure encoding hacks will not make
the problem go away, it will just make it harder to understand.
To make matters worse, Linux and GNOME use the PUA for some printable
characters. If you open up charmap on an ubuntu system and select "view
by unicode character block", then click on "private use area", you'll
see many of these. I know that Apple uses at least a few PUA codepoints
for the apple logo and the propeller/option icons as well.
I am still -1 on any turn-non-decodable-bytes-into-text, because it
makes life harder for those of us trying to keep bytes and text
straight, but if you absolutely must represent POSIX filenames as
mojibake rather than bytes, the only workable solution is to use NUL as
your escape character. That's the only code point which _actually_
can't show up in a filename somehow. As we discussed last time, this is
what Mono does with System.IO.Path. As a bonus, it's _much_ easier to
detect a NUL from random application code than to try to figure out if a
string has any half-surrogates or magic PUA characters which shouldn't
be interpreted according to platform PUA rules.
More information about the Python-Dev
mailing list