[Python-Dev] len(chr(i)) = 2?

Mon Nov 22 11:48:42 CET 2010

Raymond Hettinger writes:

 > Neither UTF-16 nor UCS-2 is exactly correct anyway.

>From a standards lawyer point of view, UCS-2 is exactly correct, as
far as I can tell upon rereading ISO 10646-1, especially Annexes H
("retransmitting devices") and Q ("UTF-16").  Annex Q makes it clear
that UTF-16 was intentionally designed so that Python-style processing
could be done in a UCS-2 context.

 > For the "wide" build, the entire range of unicode is encoded at
 > 4 bytes per character and slicing/len operate correctly since
 > every character is the same length.   This used to be called UCS-4
 > and is now UTF-32.

That's inaccurate, I believe.  UCS-4 is not a UTF, and doesn't satisfy
the range restrictions of a UTF.

 > So, with "wide" builds there isn't much confusion (except perhaps
 > unfamiliar terminology).   The real issue seems to be that for 
 > "narrow" builds, none of the usual encoding names is exactly
 > correct.  

I disagree.  I do see a problem with "UCS-2", because it fails to tell
us that Python implements a large number of features that make it easy
to do a very good job of working with non-BMP data in 16-bit builds of
Python, with no extra effort.  Python is not perfect, and (rarely)
some of the imperfections may be very distressing.  But it's very
good, and deserves to be advertised as such.

However, I don't see how "narrow" tells us more than "UCS-2" does.  If
"UCS-2" is equally (or more) informative, I prefer it because it is
the technically precise, already well-defined, term.

 > From a users point-of-view, the actual encoding or encoding name 
 > doesn't matter much.  They just need to be able to predict the relevant
 > behaviors (memory consumption and len/slicing behavior).

"UCS-2" indicates those behaviors precisely and concisely.  The
problems are (a) the lack of familiarity of users with this term, if
David is reasonably representative, and (b) the fact that it fails to
advertise Python's UTF-16 capabilities.  "Narrow" suffers from both of
those problems, and further from the fact that it has no independent
standard definition.  Furthermore, "wide" has a very widespread,
platform-dependent meaning derived from wchar_t.

If we have to document what the terms we choose mean anyway, why not
document the existing terms and reduce entropy, rather than invent new
ones and increase entropy?