[Python-Dev] PEP 383 (again)

Tue Apr 28 08:29:23 CEST 2009

I thought PEP-383 was a fairly neat approach, but after thinking about it, I
now think that it is wrong.

PEP-383 attempts to represent non-UTF-8 byte sequences in Unicode strings in
a reversible way.  But how do those non-UTF-8 byte sequences get into those
path names in the first place?  Most likely because an encoding other than
UTF-8 was used to write the file system, but you're now trying to interpret
its path names as UTF-8.

Quietly escaping a bad UTF-8 encoding with private Unicode characters is
unlikely to be the right thing, since using the wrong encoding likely means
that other characters are decoded incorrectly as well.   As a result, the
path name may fail in string comparisons and pattern matching, and will look
wrong to the user in print statements and dialog boxes. Therefore, when
Python encounters path names on a file system that are not consistent with
the (assumed) encoding for that file system, Python should raise an error.

If you really don't care what the string looks like and you just want an
encoding that round-trips without loss, you can probably just set your
encoding to one of the 8 bit encodings, like ISO 8859-15.   Decoding
arbitrary byte sequences to unicode strings as ISO 8859-15 is no less
correct than decoding them as the proposed "utf-8b".  In fact, the most
likely source of non-UTF-8 sequences is ISO 8859 encodings.

As for what the byte-oriented interfaces should do, they are simply platform
dependent.  On UNIX, they should do the obvious thing.  On Windows, they can
either hook up to the low-level byte-oriented system calls that the systems
supply, or Windows could fake it and have the byte-oriented interfaces use
UTF-8 encodings always and reject non-UTF-8 sequences as illegal (there are
already many illegal byte sequences anyway).

Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20090428/363482a0/attachment.htm>