Python's handling of unicode surrogates
josiah.carlson at gmail.com
Sun Apr 22 06:11:57 CEST 2007
On Apr 20, 7:34 pm, Rhamphoryncus <rha... at gmail.com> wrote:
> On Apr 20, 6:21 pm, "Martin v. Löwis" <mar... at v.loewis.de> wrote:
> > > I don't believe this specific variant has been discussed.
> > Now that you clarify it: no, it hasn't been discussed. I find that
> > not surprising - this proposal is so strange and unnatural that
> > probably nobody dared to suggest it.
> Difficult problems sometimes need unexpected solutions.
> Although Guido seems to be relenting slightly on the O(1) indexing
> requirement, so maybe we'll end up with an O(log n) solution (where n
> is the number of surrogates, not the length of the string).
The last thing I heard with regards to string indexing was that Guido
was very adamant about O(1) indexing. On the other hand, if one is
willing to have O(log n) access time (where n is the number of
surrogate pairs), it can be done in O(n/logn) space (again where n is
the number of surrogate pairs). An early version of the structure can
be found here: http://mail.python.org/pipermail/python-3000/2006-September/003937.html
I can't seem to find my later post with an updated version (I do have
the source somewhere).
> If you pick an index at random you will get IndexError. If you
> calculate the index using some combination of len, find, index, rfind,
> rindex you will be unaffected by my change. You can even assume the
> length of a character so long as you know it fits in 16 bits (ie any
> '\uxxxx' escape).
> I'm unaware of any practical use cases that would be harmed by my
> change, so that leaves only philosophical issues. Considering the
> difficulty of the problem it seems like an okay trade-off to me.
It is not ok for s[i] to fail for any index within the range of the
length of the string.
More information about the Python-list