[Python-3000] String comparison

Jim Jewett jimjjewett at gmail.com
Tue Jun 12 19:08:30 CEST 2007


On 6/12/07, Rauli Ruohonen <rauli.ruohonen at gmail.com> wrote:
> On 6/12/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> > On 6/12/07, Rauli Ruohonen <rauli.ruohonen at gmail.com> wrote:
> > > Practically speaking, there's little need to interpret
> > > surrogate pairs as two code points instead of as one
> > > non-BMP code point.

> > Depends on your definition of "practically".

> > Python does interpret them that way to maintain O(1) positional
> > access within strings encoded with 16 bits/char.

> Indexing does not try to interpret the string as code points at all, it
> works on code units.

Even assuming that (when most people will assume "letters", and could
maybe understand that accent marks sometimes count), it still doesn't
quite work.

Slicing (or iterating over) a string claims to return strings of the same type.

>>> for x in u"abc": print type(x)

<type 'unicode'>
<type 'unicode'>
<type 'unicode'>

Strictly speaking, the surrogate pairs should be returned together,
rather that as separate code units.  It probably won't be fixed, since
those who care most are probably using 4-byte unicode characters.

-jJ


More information about the Python-3000 mailing list