[Python-3000] Unicode and OS strings

Thu Sep 13 20:43:59 CEST 2007

"Martin v. Löwis" writes:

 > One "universal" solution is to use Unicode private-use-area
 > characters. 

+1

 > Of course, if the input data already contains PUA characters,
 > there would be an ambiguity.

That may be true in the implementation, but it shouldn't.  What should
happen internally is that all undecodable characters (which PUA
characters are by definition for standard codecs) are mapped to unused
codepoints in the PUA, chosen by Python.

This map would be required to maintain some house-keeping information
about where the character came from (specificially the original
coded character set so that round-tripping would succeed).

One possible error-recovery strategy for broken encodings (as opposed
to coding which is correct in format but contains a code point not in
the table) would be to have a "pure code unit" block in the PUA.

Note that since we're talking about code units throughout (there's no
guarantee that the encoding in question is octet-oriented, although
that's almost always the case in practice), 256 code points may not be
enough.

 > We would make a list of all interfaces that use the PUA error
 > handler: file names, environment variables, command line
 > arguments.

In general, I don't consider this an error.  It's reasonable to use
exception handling internally to the codec -- such broken texts are
rare except in interactive applications where the speed isn't an issue
-- but for some applications it would be useful to accept entire
broken strings and pass them to Python with the broken parts marked
(ie, by being assigned to the "code unit" block of the PUA) and the
rest decoded.

Here's an example that comes up in Emacs (specifically AUCTeX).  TeX
error messages are octet-oriented and regularly slice multibyte
encodings in the middle of characters or escape sequences.  It turns
out the basic codec algorithms often DTRT by (accidentally)
resynchronizing on ASCII, and sometimes can even resynch on a
multibyte character.  So the display of the "broken" text is often
useful.  However, for reasons I'm not familiar with the AUCTeX
developers have asked that the strings be invertible (ie, back to the
octets that TeX spit out).  This scheme would allow that.