[Python-Dev] len(chr(i)) = 2?

Stephen J. Turnbull stephen at xemacs.org
Tue Nov 23 16:00:22 CET 2010


If you don't care about the ISO standard, but only about Python,
Martin's right, I was wrong.  You can stop reading now.<wink>

"Martin v. Löwis" writes:

 > I could only find the FCD of 10646:2010, where annex H was integrated
 > into section 10:

Thank you for the reference.

I referred to two older versions, 10646-1:1993 (for the annexes and
Amendment, and my basic understanding) and 10646:2003 (for the
detailed definition of UCS-2 in Sections 7, 8 and 13; unfortunately, I
missed the most important detail, which is in Section 9).  In :2003
the Annex I referred to as "Annex H" is Annex J, and "Annex Q" is
partly in Section 9.1 and mostly in Annex C.  I don't know where the
former is in the 2010 FCD, and the latter is section 9.2.

 > I think they are now acknowledging that UCS-2 was a misleading term,
 > making it ambiguous whether this refers to a CCS, a CEF, or a CES;
 > like "ASCII", people have been using it for all three of them.

In :1993 it wasn't ambiguous, they simply didn't make those
distinctions.  They were not needed for ISO 10646's published
versions, although they certainly are for Unicode.

Now, quite clearly, the ISO has *changed the definition* in every new
version, progressively adding new restrictions that go beyond
clarifying ambiguity.  But even in :2003, in view of 4.2, 6.2, 6.3,
and 13.1, UCS-2 is clearly well-defined as a CM according to UTR#17,
which can probably be identified with CCS in :2003 terminology.  Ie,
returning to UTR#17 terminology, it is the composition of a CES, a
CEF, and a CCS, which are not defined individually.  Note: The
definition of "coded character" changed between :2003 and the 2010
FCD, from "character with representation" to "character with integer".

There is a NOTE indicating that 16-bit integers may be used in
processing.  Given that this is a non-normative note, I take it to
mean that in an array of 16-bit integers, "most significant octet" is
to be interpreted in the natural way for the architecture rather than
by the representation in memory, which might be little-endian.  IMO
it's unnatural to think that that changes the definition of UCS-2 to
be either a CEF, or a composition of a CEF and a CCS.

 > Apparently, the ISO WG interprets earlier revisions as saying that
 > UCS-2 is a CEF that restricted UTF-16 to the BMP.

I think that ISO 10646-1:1993 admits only one interpretation, a CM
restricted to the BMP (including surrogates), and ISO 10646:2003
admits only one interpretation, a CM restricted to the BMP (not
including surrogates).  The note under Table 4 on p.24 of the FCD is,
uh, well, a lie.  Earlier versions certainly did not restrict to
"scalar values"; they had no such concept.

 > THIS IS NOT WHAT PYTHON DOES.

Well, no shit, Sherlock.  You don't have to yell at me, I know what
Python does.  The question is, is what does UCS-2 do?  The answer is
that in :1993, AFAICT it did what Python does.  In :2003, they added
(last sentence, section 9.1):

    UCS-2 cannot be used to represent any characters on the
    supplementary planes.

I assume they maintain that position in 2010, so End Of Thread.

I apologize for missing that when I was reviewing the standard
earlier, but I expected restrictions on UCS-2 to be explained in 13.1
or perhaps 14.  And 13.1 simply requires that characters in the BMP be
represented by their defined code positions, truncated to two octets.
Like earlier versions, it doesn't prohibit use of surrogates or say
that non-BMP characters can't be represented.

 > Not sure what it says in your copy; in mine, section 9.3 says

[snip]

Mine (:2003) says "NOTE 2 - When confined to the code positions in
Planes 00 to 10, UCS-4 is also referred to as UCS Transformation
Format 32 (UTF-32)."  Then it references the Unicode Standard (v4.0)
as the authority for UTF-32.  Obviously they continued to be confused
at this point in time; by the draft you have, apparently the WG had
decided to pretty much completely synchronize the whole standard to a
subset of Unicode.  This seems pointless to me (unlike, say, the work
that has been done on standardizing criteria for repertoire changes).

In particular, the :1993 definition of UCS-2 was a perfectly good
standard for describing the processing Python actually does
internally.  The current definition of UCS-2 as identical to the BMP
is useless, and good riddance, I'm perfectly happy to have them
deprecate it.



More information about the Python-Dev mailing list