[Python-ideas] Py3k invalid unicode idea

Thu Oct 9 18:31:40 CEST 2008

On Thursday 09 October 2008, Stephen J. Turnbull wrote:
> Dillon Collins writes:
>  > My thought is this: When passed invalid unicode, keep it invalid.
>  > This is largely similar to the UTF-8b ideas that were being tossed
>  > around, but a tad different.  The idea would be to maintain invalid
>  > byte sequences by use of the private use area in the unicode spec,
>  > but be explicit about this conversion to the program.
>
> FWIW this has been suggested several times.  There are two problems
> with it.  The first is collisions with other private space users.
> Unlikely, but it will (eventually) happen.  When it does, it will very
> likely result in data corruption, because those systems will assume
> that these are valid private sequences, not reencoded pollution.

I certainly do agree that assuming PUA codes will never be used is foolish.  
As I suggested later on, you could use a PUA code as a sort of backslash 
escape to preserve both the valid PUA code and the invalid data.

>
> The second problem is that internal data will leak to other libraries.
> There is no reason to suppose that those libraries will see reencoded
> forms, because the whole point of using the C interface is to work
> directly on the Python data structures.  At that point, you do have
> corruption, because the original invalid data has been converted to
> valid Unicode.

Yes and no...  While C programs generally work on Python's internal data 
structures, they shouldn't (and basically don't) do so though direct access 
to the PyObject struct.  Instead, they use the various macros/functions 
provided.

With my proposal, unicode strings would have a valid flag, and one could 
easily modify PyUnicode_AS_UNICODE to return NULL (and a UnicodeError) if the 
string is invalid, and make a PyUnicode_AS_RAWUNICODE that wouldn't.  Or you 
could simply document that libraries need to call a PyUnicode_ISVALID to 
determine whether or not the string contains invalid codes.

>
> You write "And besides, if python has to deal with bad unicode, these
> libraries should have to too ;)."  Which is precisely right.  The bug
> in your idea is that they never will!  Your proposal robs them of the
> chance to do it in their own way by buffering it through Python's
> cleanup process.

What makes this problem nasty all around is that your proposal has the same 
bug: by not allowing invalid unicode internally, the only way to allow 
programs to handle the (possible) problem is to always accept bytes, which 
would put us most of the way back to a 2.x world.  At least with my proposal 
libraries can opt to deal with the bad, albeit slightly sanitized, unicode if 
it wants to.

>
> AFAICS, there are two sane paths.  Accept (and document!) that you
> will pass corruption to other parts of the system, and munge bad octet
> sequences into some kind of valid Unicode (eg, U+FEFF REPLACEMENT
> CHARACTER, or a PUA encoding of raw bytes).  Second, signal an error
> on encountering an invalid octet sequence, and leave it up to the user
> program to handle it.

Well, the bulk of my proposal was to allow the program to choose which one of 
those (3!) options they want.  I fail to see the benefit of forcing their 
hands, especially since the API already supports this through the use of both 
codecs and error handlers.  It just seems like a more elegant solution to me.