[Python-3000] String comparison

Tue Jun 12 16:39:48 CEST 2007

On 6/12/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> On 6/12/07, Rauli Ruohonen <rauli.ruohonen at gmail.com> wrote:
> > Practically speaking, there's little need to interpret surrogate pairs
> > as two code points instead of as one non-BMP code point.
>
> Depends on your definition of "practically".
>
> Python does interpret them that way to maintain O(1) positional access
> within strings encoded with 16 bits/char.

Indexing does not try to interpret the string as code points at all, it
works on code units. The difference is easier to see if you imagine Python
using utf-8 for strings. Indexing would still work on (8-bit) code units
instead of code points. It is higher level operations such as
unicodedata.normalize() that need to interpret strings as code points.
For 16-bit code units there are two interpretations, depending on whether
you think that surrogate pairs mean one (UTF-16) or two (UCS-2) code points.

Incidentally, unicodedata.normalize() is an example that currently does
interpret its input as UCS-2 instead of UTF-16. If you pass it a surrogate
pair it thinks of them as two code points, and won't do any normalization
for anything outside BMP on a UCS-2 build. Another example would be
unichr(), which gives you TypeError if you pass it a surrogate pair (oddly
enough, as strings of different length are of the same type).