[Python-ideas] Py3k invalid unicode idea

Tue Oct 7 14:07:29 CEST 2008

I don't know an awful lot about unicode, so rather than clog up the already 
lengthy threads on the 3k list, I figured I'd just toss this idea out over 
here.

As I understand it, there is a fairly major issue regarding malformed unicode 
data being passed to python, particularly on startup and for filenames.  This 
has lead to much discussion and ultimately the decision (?) to mirror a 
variety of OS functions to work with both bytes and unicode.  Obviously this 
puts us on a slope of questionable friction to reverting back to 2.x where 
unicode wasn't "core".

My thought is this:  When passed invalid unicode, keep it invalid.  This is 
largely similar to the UTF-8b ideas that were being tossed around, but a tad 
different.  The idea would be to maintain invalid byte sequences by use of 
the private use area in the unicode spec, but be explicit about this 
conversion to the program.

In particular, I'm suggesting the addition of the following (I'll 
use "surrogate" to refer to the invalid bytes in a unicode string):

1) Encoding 'raw'.  Force all bytes to be converted to surrogate values.  
Decoding to raw converts the bytes back, and gives an error on valid unicode 
characters(!).  This would enable applications to effectively interface with 
the system using bytes (by setting default encoding or the like), but not 
require any API changes to actually support the bytes type.

2) Error handler 'force' (or whatever).  For decoding, when an invalid byte is 
encountered, replace with a surrogate.  For encoding, write the invalid byte.

2a) Decoding invalid unicode or encoding a string with surrogates raises a 
UnicodeError (unless handler 'force' is specified or encoding is 'raw').

3) string method 'valid'.  'valid()' would return False if the string contains 
at least one surrogate and True otherwise.  This would allow programs to 
check if the string is correct, and handle it not.  This would be of 
particular value when reading boot information like sys.argv as that would 
use the 'force' error handler in order to prevent boot failure.

How the invalid bytes would be stored internally is certainly a matter of hot 
debate on the 3k list.  As I mentioned before, I am not intimately familiar 
with unicode, so I don't have much to suggest.  If I had to implement it 
myself now, I'd probably use a piece of the private use area as an escape 
(much like '\\' does).

Finally, there seems to be much concern about internal invalid unicode 
wreaking havoc when tossed to external programs/libraries.  I have to say 
that I don't really see what the problem is, because whenever python writes 
unicode, oughtn't it be buffered by "encode"?  In that case you'd either get 
an error or would be explicitly allowing invalid strings (via 'raw' 
or 'force').  And besides, if python has to deal with bad unicode, these 
libraries should have to too ;).

Even more finally, let me apologize in advance if I missed something on 
another list or this is otherwise too redundant.