[Python-Dev] len(chr(i)) = 2?

Raymond Hettinger raymond.hettinger at gmail.com
Sun Nov 21 19:17:57 CET 2010


On Nov 21, 2010, at 9:38 AM, R. David Murray wrote:
> 
> I'm sorry, but I have to disagree.  As a relative unicode ignoramus,
> "UCS-2" and "UCS-4" convey almost no information to me, and the bits I
> have heard about them on this list have only confused me. 

From the users point of view, it doesn't much matter which encoding is
used internally.  

Neither UTF-16 nor UCS-2 is exactly correct anyway.  The former encodes
the entire range of unicode characters in a variable length code 
(a character is usually 2 bytes but is sometimes 4 bytes long).  The latter
encodes only a subset of unicode (the basic mulitlingual plane) in a
fixed-length code of bytes per character).

What we use internally looks like utf-16 but a character encoded with
4 bytes is treated as two 2-byte characters (hence the subject of this
thread).   Our hybrid internal coding lets use handle the entire
range of unicode while getting speed and simplicity by doing len() 
and slicing with a surrogate pair being treated as two separate
characters).

For the "wide" build, the entire range of unicode is encoded at
4 bytes per character and slicing/len operate correctly since
every character is the same length.   This used to be called UCS-4
and is now UTF-32.

So, with "wide" builds there isn't much confusion (except perhaps
unfamiliar terminology).   The real issue seems to be that for 
"narrow" builds, none of the usual encoding names is exactly correct.  

From a users point-of-view, the actual encoding or encoding name 
doesn't matter much.  They just need to be able to predict the relevant
behaviors (memory consumption and len/slicing behavior).

For the narrow build, that behavior is:
- Characters in the BMP consume 2 bytes and count as one char
  for purposes of len and slicing.
- Characters above the BMP consume 4 bytes and counts as
  two distinct chars for purpose of len and slicing.

For wide builds, all characters are 4 bytes and count as a single
char for len and slicing.

Hope this helps,


Raymond


More information about the Python-Dev mailing list