[Python-3000] Unicode and OS strings
Marcin 'Qrczak' Kowalczyk
qrczak at knm.org.pl
Mon Sep 17 21:12:00 CEST 2007
Dnia 15-09-2007, So o godzinie 09:13 +0900, Stephen J. Turnbull
napisał(a):
> > Well, for any scheme which attempts to modify UTF-8 by accepting
> > arbitrary byte strings is used, *something* must be interpreted
> > differently than in real UTF-8.
>
> Wrong. In my scheme everything ends up in the PUA, on which real
> UTF-8 imposes no interpretation by definition.
This is wrong: UTF-8 is specified for PUA. PUA is no special from the
point of view of UTF-8. UTF-8 is defined for all Unicode scalar values,
i.e. all code points in the ranges U+0000..U+D7FF and U+E000..U+10FFFF,
i.e. all code points excluding surrogates. This includes PUA.
> I haven't gone back to check yet, but it's possible that a "real UTF-8
> conforming process" is required to stop processing and issue an error
> or something like that in the cases we're trying to handle.
"C10. When a process interprets a code unit sequence which purports to
be in a Unicode character encoding form, it shall treat ill-formed code
unit sequences as an error condition and shall not interpret such
sequences as characters."
--
__("< Marcin Kowalczyk
\__/ qrczak at knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/
More information about the Python-3000
mailing list