[Python-Dev] UCS2/UCS4 default

"Martin v. Löwis" martin at v.loewis.de
Thu Jul 3 19:31:14 CEST 2008


> Basically everything but string forming or string printing seems to be
> broken for surrogate pairs, from what I can tell.

We probably disagree what "it works correctly" means. I think everything
works correctly.

> Also, I think you are confused about slicing in the middle of a surrogate
> pair, from a UTF-16 perspective this is 1 codepoint!

Yes, but it is two code units. Python's UTF-16 implementation operates
on code units, not code points.

> And as such Python
> needs to treat it as one character/codepoint in a string, dealing with
> slicing as appropriate.

It does. However, functions such as len, and all indexing, operate in
code units, not code points.

> The way you currently describe it is that UTF-16
> strings will be treated as UCS-2 when it comes to slicing and the likes.

No. In UCS-2, the surrogate range is reserved (for UTF-16). In Python,
it's not reserved, but interpreted as UTF-16.

> From a UTF-16 point of view such slicing can NEVER occur unless you are bit
> or byte slicing instead of character/codepoint slicing.

It most certainly can. UTF-16 is not a character set, but a character
encoding form (unlike UCS-2, which is a coded character set). Slicing
*can* occur at the code unit level. UTF-16 is also understood as a
character encoding scheme (by means of the BOM), then slicing can
occur even on the byte level.

> I think it can be fairly said that an item in a string is a character or
> codepoint.

Not in Python - it's a code unit.

Regards,
Martin


More information about the Python-Dev mailing list