[Python-ideas] Py3k invalid unicode idea
Dillon Collins
dillonco at comcast.net
Tue Oct 7 14:07:29 CEST 2008
I don't know an awful lot about unicode, so rather than clog up the already
lengthy threads on the 3k list, I figured I'd just toss this idea out over
here.
As I understand it, there is a fairly major issue regarding malformed unicode
data being passed to python, particularly on startup and for filenames. This
has lead to much discussion and ultimately the decision (?) to mirror a
variety of OS functions to work with both bytes and unicode. Obviously this
puts us on a slope of questionable friction to reverting back to 2.x where
unicode wasn't "core".
My thought is this: When passed invalid unicode, keep it invalid. This is
largely similar to the UTF-8b ideas that were being tossed around, but a tad
different. The idea would be to maintain invalid byte sequences by use of
the private use area in the unicode spec, but be explicit about this
conversion to the program.
In particular, I'm suggesting the addition of the following (I'll
use "surrogate" to refer to the invalid bytes in a unicode string):
1) Encoding 'raw'. Force all bytes to be converted to surrogate values.
Decoding to raw converts the bytes back, and gives an error on valid unicode
characters(!). This would enable applications to effectively interface with
the system using bytes (by setting default encoding or the like), but not
require any API changes to actually support the bytes type.
2) Error handler 'force' (or whatever). For decoding, when an invalid byte is
encountered, replace with a surrogate. For encoding, write the invalid byte.
2a) Decoding invalid unicode or encoding a string with surrogates raises a
UnicodeError (unless handler 'force' is specified or encoding is 'raw').
3) string method 'valid'. 'valid()' would return False if the string contains
at least one surrogate and True otherwise. This would allow programs to
check if the string is correct, and handle it not. This would be of
particular value when reading boot information like sys.argv as that would
use the 'force' error handler in order to prevent boot failure.
How the invalid bytes would be stored internally is certainly a matter of hot
debate on the 3k list. As I mentioned before, I am not intimately familiar
with unicode, so I don't have much to suggest. If I had to implement it
myself now, I'd probably use a piece of the private use area as an escape
(much like '\\' does).
Finally, there seems to be much concern about internal invalid unicode
wreaking havoc when tossed to external programs/libraries. I have to say
that I don't really see what the problem is, because whenever python writes
unicode, oughtn't it be buffered by "encode"? In that case you'd either get
an error or would be explicitly allowing invalid strings (via 'raw'
or 'force'). And besides, if python has to deal with bad unicode, these
libraries should have to too ;).
Even more finally, let me apologize in advance if I missed something on
another list or this is otherwise too redundant.
More information about the Python-ideas
mailing list