[Python-Dev] a suggestion ... Re: PEP 383 (again)
tmbdev at gmail.com
Tue Apr 28 21:01:58 CEST 2009
I think we should break up this problem into several parts:
(1) Should the default UTF-8 decoder fail if it gets an illegal byte
It's probably OK for the default decoder to be lenient in some way (see
(2) Should the default UTF-8 encoder for file system operations be allowed
to generate illegal byte sequences?
I think that's a definite no; if I set the encoding for a device to UTF-8, I
never want Python to try to write illegal UTF-8 strings to my device.
(3) What kind of representation should the UTF-8 decoder return for illegal
There are actually several choices: (a) it could guess what the actual
encoding is and use that, (b) it could return a valid unicode string that
indicates the illegal characters but does not re-encode to the original byte
sequence, or (c) it could return some kind of non-standard representation
that encodes back into the original byte sequence.
PEP 383 violated (2), and I think that's a bad thing.
I think the best solution would be to use (3a) and fall back to (3b) if that
doesn't work. If people try to write those strings, they will always get
written as correctly encoded UTF-8 strings.
If people really want the option of (3c), then I think encoders related to
the file system should by default reject those strings as illegal because
the potential problems from writing them are just too serious. Printing
routines and UI routines could display them without error (but some clear
indication), of course.
There is yet another option, which is arguably the "right" one: make the
results of os.listdir() subclasses of string that keep track of where they
came from. If you write back to the same device, it just writes the same
byte sequence. But if you write to other devices and the byte sequence is
illegal according to its encoding, you get an error.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Python-Dev