[Python-Dev] UCS2/UCS4 default
"Martin v. Löwis"
martin at v.loewis.de
Thu Jul 3 19:31:14 CEST 2008
> Basically everything but string forming or string printing seems to be
> broken for surrogate pairs, from what I can tell.
We probably disagree what "it works correctly" means. I think everything
works correctly.
> Also, I think you are confused about slicing in the middle of a surrogate
> pair, from a UTF-16 perspective this is 1 codepoint!
Yes, but it is two code units. Python's UTF-16 implementation operates
on code units, not code points.
> And as such Python
> needs to treat it as one character/codepoint in a string, dealing with
> slicing as appropriate.
It does. However, functions such as len, and all indexing, operate in
code units, not code points.
> The way you currently describe it is that UTF-16
> strings will be treated as UCS-2 when it comes to slicing and the likes.
No. In UCS-2, the surrogate range is reserved (for UTF-16). In Python,
it's not reserved, but interpreted as UTF-16.
> From a UTF-16 point of view such slicing can NEVER occur unless you are bit
> or byte slicing instead of character/codepoint slicing.
It most certainly can. UTF-16 is not a character set, but a character
encoding form (unlike UCS-2, which is a coded character set). Slicing
*can* occur at the code unit level. UTF-16 is also understood as a
character encoding scheme (by means of the BOM), then slicing can
occur even on the byte level.
> I think it can be fairly said that an item in a string is a character or
> codepoint.
Not in Python - it's a code unit.
Regards,
Martin
More information about the Python-Dev
mailing list