[Python-ideas] Py3k invalid unicode idea

Thu Oct 9 11:11:42 CEST 2008

Dillon Collins writes:

 > My thought is this: When passed invalid unicode, keep it invalid.
 > This is largely similar to the UTF-8b ideas that were being tossed
 > around, but a tad different.  The idea would be to maintain invalid
 > byte sequences by use of the private use area in the unicode spec,
 > but be explicit about this conversion to the program.

FWIW this has been suggested several times.  There are two problems
with it.  The first is collisions with other private space users.
Unlikely, but it will (eventually) happen.  When it does, it will very
likely result in data corruption, because those systems will assume
that these are valid private sequences, not reencoded pollution.

One way to avoid this would be to have a configuration option
(runtime) for where to start the private encoding space.  It still
won't avoid it completely because some applications don't know or care
what is valid, and therefore might pass you anything.  But mostly it
should win because people who are assigning semantics to private space
characters will need to know what characters they're using, and the
others will rarely be able to detect corruption anyway.

The second problem is that internal data will leak to other libraries.
There is no reason to suppose that those libraries will see reencoded
forms, because the whole point of using the C interface is to work
directly on the Python data structures.  At that point, you do have
corruption, because the original invalid data has been converted to
valid Unicode.

You write "And besides, if python has to deal with bad unicode, these
libraries should have to too ;)."  Which is precisely right.  The bug
in your idea is that they never will!  Your proposal robs them of the
chance to do it in their own way by buffering it through Python's
cleanup process.

AFAICS, there are two sane paths.  Accept (and document!) that you
will pass corruption to other parts of the system, and munge bad octet
sequences into some kind of valid Unicode (eg, U+FEFF REPLACEMENT
CHARACTER, or a PUA encoding of raw bytes).  Second, signal an error
on encountering an invalid octet sequence, and leave it up to the user
program to handle it.