Python's handling of unicode surrogates

Josiah Carlson josiah.carlson at gmail.com
Sun Apr 22 06:11:57 CEST 2007


On Apr 20, 7:34 pm, Rhamphoryncus <rha... at gmail.com> wrote:
> On Apr 20, 6:21 pm, "Martin v. Löwis" <mar... at v.loewis.de> wrote:
>
> > > I don't believe this specific variant has been discussed.
> > Now that you clarify it: no, it hasn't been discussed. I find that
> > not surprising - this proposal is so strange and unnatural that
> > probably nobody dared to suggest it.
>
> Difficult problems sometimes need unexpected solutions.
>
> Although Guido seems to be relenting slightly on the O(1) indexing
> requirement, so maybe we'll end up with an O(log n) solution (where n
> is the number of surrogates, not the length of the string).

The last thing I heard with regards to string indexing was that Guido
was very adamant about O(1) indexing.  On the other hand, if one is
willing to have O(log n) access time (where n is the number of
surrogate pairs), it can be done in O(n/logn) space (again where n is
the number of surrogate pairs).  An early version of the structure can
be found here: http://mail.python.org/pipermail/python-3000/2006-September/003937.html

I can't seem to find my later post with an updated version (I do have
the source somewhere).

> If you pick an index at random you will get IndexError.  If you
> calculate the index using some combination of len, find, index, rfind,
> rindex you will be unaffected by my change.  You can even assume the
> length of a character so long as you know it fits in 16 bits (ie any
> '\uxxxx' escape).
>
> I'm unaware of any practical use cases that would be harmed by my
> change, so that leaves only philosophical issues.  Considering the
> difficulty of the problem it seems like an okay trade-off to me.

It is not ok for s[i] to fail for any index within the range of the
length of the string.

 - Josiah




More information about the Python-list mailing list