[Python-Dev] len(chr(i)) = 2?
Stephen J. Turnbull
turnbull at sk.tsukuba.ac.jp
Tue Nov 23 17:16:55 CET 2010
"Martin v. Löwis" writes:
> I disagree: Quoting from Unicode 5.0, section 5.4:
>
> # The individual components of implementations may have different
> # levels of support for surrogates, as long as those components are
> # assembled and communicate correctly.
"Assembly" is the problem. If chr() or a slice creates a lone
surrogate and surrogateescape passes it back out, Python as a whole is
non-conforming.
Technically, you can hide behind "none of slicing, chr(), or
surrogateescape promises to conform", and maybe that would fly to a
standards lawyer; I'd have to see the precise statement.
Here's a more convincing example. A user specifies "utf8" as her
locale charset. Then she specifies a string containing a non-BMP
character as the "description" of a file, and internal code munges
this via slicing into a file name conforming to some specification
(eg, length limit + uniquifier if needed). Then if the non-BMP
character is in the "right" place, she will get either a broken file
name, which will either get written to disk or raise an exception,
depending on whether the munging program has enabled surrogateescape
or not.
I claim both of those results are non-conforming to the specification
of UTF-16, and therefore Python Unicode processing as a whole must be
considered non-conforming.
It's still pretty damn good. But I've elaborated that point
elsewhere.
> The rationale for supporting these characters in chr() goes back much
> further than the surrogateescape handler - as Python unicode strings
> are sequences of code points, it would be impractical if you couldn't
> create some of them, or even would have to consult the UCD before
> determining whether they can be created.
The Zen is irrelevant to determining conformance to Unicode, which has
its own Zen.
More information about the Python-Dev
mailing list