[Python-Dev] len(chr(i)) = 2?

Stephen J. Turnbull stephen at xemacs.org
Sat Nov 20 05:11:48 CET 2010


"Martin v. Löwis" writes:

 > The term "UCS-2" is a character set that can encode only encode 65536
 > characters; it thus refers to Unicode 1.1. According to the Unicode
 > Consortium's FAQ, the term UCS-2 should be avoided these days.

So what do you propose we call the Python implementation?  You can
call it "code-unit-oriented" if you like, but in fact it is identical
to UCS-2 for all non-hairsplitting purposes.  AFAICS the Unicode
Consortium deprecates the *term* UCS-2 because they would like us to
avoid *implementations* that don't encode the full Unicode character
set, not because the term is technically incorrect.

Strictly speaking, internally Python only encodes 65536 characters in
2-octet builds.  Its (Unicode) string-handling code does not know
about surrogates at all, AFAIK, and therefore is not UTF-16
conforming.  (The anomolies discussed here are type transformations,
not string-handling, for my purpose.)  I really don't see why we
shouldn't call a UCS-2 implementation by its name.

AFAIK this was not supposed to change in Python 3; indexing and
slicing go by code unit (isomorphic to UCS-n), not character, and due
to PEP 383 4-octet builds do not conform (internally) to UTF-32, and
can produce output that conforms to Unicode not at all (as a user
option, of course, but it's still non-conformant).

 > > IMO, we should go back to the Python2 terms UCS2 and UCS4 which
 > > are correct and provide a clear description of what Python uses
 > > internally for code units.
 > 
 > No, we shouldn't. The term UCS-2 is deprecated, see above.

Too bad for the Unicode Consortium, I say.  UCS-2 is the closest term
that folks who are not Unicode geeks will have a chance of
understanding.

I agree with Marc-Andre that "narrow" and "wide" are too ambiguous to
be useful.  Many people will interpret that as "UTF-16" (or even
"UTF-8") and "UTF-32", respectively, which is dead wrong.  Others
won't have a clue.  Using "UCS-2" and "UCS-4" has the correct
connotations to Unicode geeks, and they are easy to look up for
non-geeks who care about precise definitions.  Cf. the second half of
the FAQ you quote:

    Instead, "UCS-2" has sometimes been used in the past to indicate
    that an implementation does not support supplementary characters
    and doesn't interpret pairs of surrogate code points as
    characters. Such an implementation would not handle processing
    like character properties, codepoint boundaries, collation,
    etc. for supplementary characters.

"Hey, Python, I'm looking at you!"  (Strictly speaking, Python
libraries do some of that for us, but the Python *language* does not.)



More information about the Python-Dev mailing list