[Python-3000] Unicode and OS strings

Fri Sep 14 09:49:33 CEST 2007

Dnia 14-09-2007, Pt o godzinie 15:02 +0900, Stephen J. Turnbull
napisał(a):

>  > PUA already has a representation in UTF-8, so this is more incompatible
>  > with UTF-8 than needed,
> 
> Hm?  It's not incompatible at all, and we're not interested in a
> representation in UTF-8, but rather in UTF-16

PUA is representable in both. When the command line contains an UTF-8
encoding of U+E650 (a PUA character), the script should better receive
a UTF-16 or UTF-32 encoding of U+E650 in the appropriate place,
otherwise we are corrupting user data.

> (ie, the Python internal encoding).

(Python also uses UTF-32 alternatively to UTF-16.)

> And it *is* needed, because these characters by assumption
> are not present in Unicode at all.  (More precisely, they may be
> present, but the tables we happen to have don't have mappings for
> them.)

They are present! For UTF-8, UTF-16 and UTF-32 PUA is not special in
any way. It's just a block of characters which will never be officially
assigned by the Unicode Consortium, so they can be used privately among
parties who agree about their meaning.

> Your escaping proposal *guarantees* mangling because it turns
> characters into tuples of code units; it does not preserve character
> set information.

Huh? What do you mean by preserving character set information?

It preserves the byte string contents, which is all that is needed.
It has the same result as UTF-8 for all valid UTF-8 sequences not
containing NUL.

>  > While U+0000 is also representable in UTF-8, it cannot occur in
>  > filenames, program arguments, environment variables etc., in many
>  > contexts it was free.
> 
> In your experience, and mine, but is it in POSIX?

Yes. Both as specified and in the reality (e.g. POSIX offers the second
parameter of main() of type char ** as the only way to receive command
line arguments, and they are NUL-terminated).

> I'm also very bothered by the fact that the interpretation of U+0000
> differs in different contexts in your proposal.

Well, for any scheme which attempts to modify UTF-8 by accepting
arbitrary byte strings is used, *something* must be interpreted
differently than in real UTF-8.

> Once you get a
> string into Python, you normally no longer know where it came from,
> but now whether something came from the program argument or
> environment or from a stdio stream changes the semantics of U+0000.
> For me personally, that's a very good reason to object to your
> proposal.

This can be said about any modification of UTF-8.

Of course you can use such encoding on a standard stream too. In this
case only U+0000 cannot be used normally, and the resulting stream will
contain whatever bytes were present in filenames and other strings being
output to it.

>  > Of course my escaping scheme can preserve \0 too, by escaping it to
>  > U+0000 U+0000, but here it's incompatible with the real UTF-8.
> 
> No.  It's *never* compatible with UTF-8 because it assigns a different
> meaning to U+0000 from ASCII NUL.

It is compatible with UTF-8 except for U+0000, and a true U+0000 cannot
occur anyway in these contexts, so this incompatibility is mostly
harmless.

> Your scheme also suffers from the practical problem that strings
> containing escapes are no longer arrays of characters.

They are no less arrays of characters than strings containing combining
marks.

[And now I'm gone for 4 days.]

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/