[Python-3000] Unicode and OS strings

Sat Sep 15 02:13:31 CEST 2007

"Marcin 'Qrczak' Kowalczyk" <qrczak at knm.org.pl> writes:

 >> And it *is* needed, because these characters by assumption
 >> are not present in Unicode at all.  (More precisely, they may be
 >> present, but the tables we happen to have don't have mappings for
 >> them.)

 > They are present! For UTF-8, UTF-16 and UTF-32 PUA is not special in
 > any way.

The characters I am referring to are the unstandardized so-called
"corporate characters" that are very common in Japanese text.  My
solution handles your problem, slightly less efficiently than yours
does, perhaps, but in a Unicode-conforming way.  Yours doesn't help
with mine at all.

 > It preserves the byte string contents, which is all that is needed.

That is not true in any environment where the encoding is not known
with certainty.

 > It has the same result as UTF-8 for all valid UTF-8 sequences not
 > containing NUL.

Sorry, I'm talking about real Japanese and other situations where
there is no corresponding Unicode character point, and a solution
which not only handles that but also handles corrupt UTF-8.  Valid
UTF-8 is not a problem, it's the solution.  But a robust language
should handle text that is not valid UTF-8 in a way that allows the
programmer or user to implement error correction at a finer-grained
level than dumping core.

 >> I'm also very bothered by the fact that the interpretation of U+0000
 >> differs in different contexts in your proposal.

 > Well, for any scheme which attempts to modify UTF-8 by accepting
 > arbitrary byte strings is used, *something* must be interpreted
 > differently than in real UTF-8.

Wrong.  In my scheme everything ends up in the PUA, on which real
UTF-8 imposes no interpretation by definition.

I haven't gone back to check yet, but it's possible that a "real UTF-8
conforming process" is required to stop processing and issue an error
or something like that in the cases we're trying to handle.  But your
extension and James Knight's extension both fall afoul of any such
provision, too.

 >> Once you get a string into Python, you normally no longer know
 >> where it came from, but now whether something came from the
 >> program argument or environment or from a stdio stream changes the
 >> semantics of U+0000.  For me personally, that's a very good reason
 >> to object to your proposal.

 > This can be said about any modification of UTF-8.

It's not true of James Knight's proposal, because the same
modification can be used for both program arguments and file streams.

And my proposal doesn't modify UTF-8 at all; it takes advantage of the
farsighted wisdom of the designers of Unicode and puts all the
non-standard "characters", including broken encoding, in the PUA.

 > Of course you can use such encoding on a standard stream too. In
 > this case only U+0000 cannot be used normally, and the resulting
 > stream will contain whatever bytes were present in filenames and
 > other strings being output to it.

A programmer can use it, but his users will curse his name every time
a binary stream gets corrupted because they forgot that little detail.

 >>  > Of course my escaping scheme can preserve \0 too, by escaping it to
 >>  > U+0000 U+0000, but here it's incompatible with the real UTF-8.

 >> No.  It's *never* compatible with UTF-8 because it assigns a different
 >> meaning to U+0000 from ASCII NUL.

 > It is compatible with UTF-8 except for U+0000, and a true U+0000 cannot
 > occur anyway in these contexts, so this incompatibility is mostly
 > harmless.

Forcing users to use codecs of subtly different semantics simply because
they're getting I/O from different sources is a substantial harm.

 >> Your scheme also suffers from the practical problem that strings
 >> containing escapes are no longer arrays of characters.

 > They are no less arrays of characters than strings containing combining
 > marks.

Those marks are characters in their own right.  Your escapes are not,
nor are surrogates.

It's true that users will be surprised by the count of characters in
many cases with unnormalized Unicode, but these can be reduced to a
very few by normalizing to NFC.