[Python-Dev] UCS2/UCS4 default

Jeroen Ruigrok van der Werven asmodai at in-nomine.org
Thu Jul 3 12:48:13 CEST 2008


My apologies for hammering on this, but I think it is quite important and
currently Python 3.0 seems confused about UCS-2 versus UTF-16.

-On [20080702 20:47], Guido van Rossum (guido at python.org) wrote:
>No, Python already is aware of surrogates. I meant applications
>processing non-BMP text should beware of them.

Just to make sure people are fully aware of the distinctions:

UCS-2 uses 16 bits to encode Unicode data, does NOT support surrogate pairs
and therefore CANNOT represent data beyond U+FFFF (thus only supporting the
Basic Multilingual Plane, BMP). It is a fixed-length character encoding.

UTF-16 also uses 16 bits to encode Unicode data, but DOES support surrogate
pairs and therefore CAN represent data beyond U+FFFF by using said surrogate
pairs (thus supporting all planes). It is a variable-length character
encoding.

So a string representation in UCS-2 means every character occupies 16 bits.
A string representation in UTF-16 means characters can occupy 16 bits or
32-bits.

If one stays within the BMP than all is well, but when you move beyond the
BMP (U+10000 - U+10FFFF) then Python needs to correctly check the string
for surrogate pairs and deal with them internally.

>If you find places where the Python core or standard library is doing
>Unicode processing that would break when surrogates are present you
>should file a bug. However this does not mean that every bit of code
>that slices a string at an arbitrary point (and hence risks slicing in
>the middle of a surrogate) is incorrect -- it all depends on what is
>done next with the slice.

Basically everything but string forming or string printing seems to be
broken for surrogate pairs, from what I can tell.
Also, I think you are confused about slicing in the middle of a surrogate
pair, from a UTF-16 perspective this is 1 codepoint! And as such Python
needs to treat it as one character/codepoint in a string, dealing with
slicing as appropriate. The way you currently describe it is that UTF-16
strings will be treated as UCS-2 when it comes to slicing and the likes.
>From a UTF-16 point of view such slicing can NEVER occur unless you are bit
or byte slicing instead of character/codepoint slicing.

The documentation for len() says:
Return the length (the number of items) of an object.

I think it can be fairly said that an item in a string is a character or
codepoint. Take for example the following string:

a = '\U00020045\u942a' # Two hanzi/kanji/hanja

>From a Unicode perspective we are looking at two characters/codepoints.
When we use a 4-byte Python 3.0 binary we get (as expected):
>>> len(a)
2

When we use a 2-byte Python 3.0 binary (the default) we get (not as
expected):
>>> len(a)
3

>From a UTF-16 perspective a surrogate pair is one character/codepoint and
as such len() should have reported 2 as well. That the sequence is stored
internally as 0xd840 0xdc45 0x942a and occupies 3 bytes is not interesting.
But it seems as if len() is treating the string as being in UCS-2
(fixed-length), which is the only logical explanation for the number 3,
instead of treating it as UTF-16 (variable-length) and reporting the number
2.

Subsequently doing a: print a[1] to get the 0x942a (鐪) actually requires
a[2] on the 2-byte Python 3.0. As such the code you write for 2-byte and
4-byte Python 3.0 is *different* when you have to deal with the same Unicode
strings! This cannot be the desired situation, can it?

Two more examples:

>>> a.find('鐪') # 4-byte
1
>>> a.find('鐪') # 2-byte
2

>>> import re # 4-byte
>>> m = re.search('鐪', a)
>>> m.start()
1
>>> import re # 2-byte
>>> m = re.search('鐪', a)
>>> m.start()
2

This, in my opinion, has nothing to do with the application writers, but
more with Python's internals being confused about UCS-2 and UTF-16. We
accept full 32-bit codepoints with the \U escape in strings, and we may even
store it as UTF-16 internally, but we clearly do not deal with it properly
as UTF-16, but rather as UCS-2, when it comes to using said strings with
core functions and modules.

-- 
Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
For wouldst thou not carve at my Soul with thine sword of Supreme Truth?


More information about the Python-Dev mailing list