[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Thu Apr 30 18:39:52 CEST 2009

Cameron Simpson writes:
 > On 29Apr2009 22:14, Stephen J. Turnbull <stephen at xemacs.org> wrote:
 > | Baptiste Carvello writes:
 > |  > By contrast, if the new utf-8b codec would *supercede* the old one,
 > |  > \udcxx would always mean raw bytes (at least on UCS-4 builds, where
 > |  > surrogates are unused). Thus ambiguity could be avoided.
 > | 
 > | Unfortunately, that's false.  [Because Python strings are
 > | intended to be used as containers for widechars which are to be
 > | interpreted as Unicode when that makes sense, but there's no
 > | restriction against nonsense code points, including in UCS-4
 > | Python.]

[...]

 > Wouldn't you then be bypassing the implicit encoding anyway, at least to
 > some extent, and thus not trip over the PEP?

Sure.  I'm not really arguing the PEP here; the point is that under
the current definition of Python strings, ambiguity is unavoidable.
The best we can ask for is fewer exceptions, and an attempt to reduce
ambiguity to a bare minimum in the code paths that we open up when we
make definition that allows a formerly erroneous computation to
succeed.

Martin is well aware of this, the PEP is clear enough about that (to
me, but I'm a mail and multilingual editor internals kinda guy<wink>).
I'd rather have more validation of strings, but *shrug* Martin's doing
the work.

OTOH, the Unicode fans need to understand that past policy of Python
is not to validate; Python is intended to provide all the tools needed
to write validating apps, but it isn't one itself.  Martin's PEP is
quite narrow in that sense.  All it is about is an invertible encoding
of broken encodings.  It does have the downside that it guarantees
that Python itself can produce non-conforming strings, but that's not
the end of the world, and an app can keep track of them or even refuse
them by setting the error handler, if it wants to.