[Python-3000] Unicode and OS strings
Marcin 'Qrczak' Kowalczyk
qrczak at knm.org.pl
Tue Sep 18 11:12:19 CEST 2007
Dnia 18-09-2007, Wt o godzinie 13:08 +0900, Stephen J. Turnbull
> > This is wrong: UTF-8 is specified for PUA. PUA is no special from the
> > point of view of UTF-8.
> It is from the point of view of the Unicode standard, specifically v5.
> Please see section 16.5, especially about the "corporate use subarea".
It is not. 16.5 doesn't say anything about UTF-8, and UTF-8 is already
specified for PUA.
> > UTF-8 is defined for all Unicode scalar values,
> Sure, and what I propose is entirely compatible with the specification
> of UTF-8 as a UTF,
It is not. In UTF-8 '\ue650' is b'\xEE\x99\x90', in your proposal it
might be encoded as a single byte.
> > "C10. When a process interprets a code unit sequence which purports to
> > be in a Unicode character encoding form, it shall treat ill-formed code
> > unit sequences as an error condition and shall not interpret such
> > sequences as characters."
> Yeah, that's the one.
> While I'm uncomfortable advocating the position that my proposal is
> entirely compatible with C10,
It is not. Elements of PUA are characters.
> it is arguable that "mapping code units to
> characters in private space" is not the same as "interpreting them as
It's not the same, but interpreting as characters in PUA is obviously
interpreting as characters.
> chibi:MacPorts steve$ python -c 'import sys; print("%x" % ord(sys.argv))' $(printf "\ue650")
> Traceback (most recent call last):
> File "<string>", line 1, in ?
> TypeError: ord() expected a character, but string of length 6 found
I meant Python3 where sys.argv is a list of Unicode strings. It should
work out of the box.
Why length 6? "\ue650" encoded in UTF-8 has length 3.
For an old discussion about using PUA to represent bytes undecodable
as UTF-8, see http://email@example.com/ and
subthreads with "roundtripping" in the subject.
__("< Marcin Kowalczyk
\__/ qrczak at knm.org.pl
More information about the Python-3000