[Python-ideas] Py3k invalid unicode idea
Dillon Collins
dillonco at comcast.net
Thu Oct 9 18:31:40 CEST 2008
On Thursday 09 October 2008, Stephen J. Turnbull wrote:
> Dillon Collins writes:
> > My thought is this: When passed invalid unicode, keep it invalid.
> > This is largely similar to the UTF-8b ideas that were being tossed
> > around, but a tad different. The idea would be to maintain invalid
> > byte sequences by use of the private use area in the unicode spec,
> > but be explicit about this conversion to the program.
>
> FWIW this has been suggested several times. There are two problems
> with it. The first is collisions with other private space users.
> Unlikely, but it will (eventually) happen. When it does, it will very
> likely result in data corruption, because those systems will assume
> that these are valid private sequences, not reencoded pollution.
I certainly do agree that assuming PUA codes will never be used is foolish.
As I suggested later on, you could use a PUA code as a sort of backslash
escape to preserve both the valid PUA code and the invalid data.
>
> The second problem is that internal data will leak to other libraries.
> There is no reason to suppose that those libraries will see reencoded
> forms, because the whole point of using the C interface is to work
> directly on the Python data structures. At that point, you do have
> corruption, because the original invalid data has been converted to
> valid Unicode.
Yes and no... While C programs generally work on Python's internal data
structures, they shouldn't (and basically don't) do so though direct access
to the PyObject struct. Instead, they use the various macros/functions
provided.
With my proposal, unicode strings would have a valid flag, and one could
easily modify PyUnicode_AS_UNICODE to return NULL (and a UnicodeError) if the
string is invalid, and make a PyUnicode_AS_RAWUNICODE that wouldn't. Or you
could simply document that libraries need to call a PyUnicode_ISVALID to
determine whether or not the string contains invalid codes.
>
> You write "And besides, if python has to deal with bad unicode, these
> libraries should have to too ;)." Which is precisely right. The bug
> in your idea is that they never will! Your proposal robs them of the
> chance to do it in their own way by buffering it through Python's
> cleanup process.
What makes this problem nasty all around is that your proposal has the same
bug: by not allowing invalid unicode internally, the only way to allow
programs to handle the (possible) problem is to always accept bytes, which
would put us most of the way back to a 2.x world. At least with my proposal
libraries can opt to deal with the bad, albeit slightly sanitized, unicode if
it wants to.
>
> AFAICS, there are two sane paths. Accept (and document!) that you
> will pass corruption to other parts of the system, and munge bad octet
> sequences into some kind of valid Unicode (eg, U+FEFF REPLACEMENT
> CHARACTER, or a PUA encoding of raw bytes). Second, signal an error
> on encountering an invalid octet sequence, and leave it up to the user
> program to handle it.
Well, the bulk of my proposal was to allow the program to choose which one of
those (3!) options they want. I fail to see the benefit of forcing their
hands, especially since the API already supports this through the use of both
codecs and error handlers. It just seems like a more elegant solution to me.
More information about the Python-ideas
mailing list