[Python-3000] How will unicode get used?

Josiah Carlson jcarlson at uci.edu
Thu Sep 28 01:32:33 CEST 2006


"Martin v. Löwis" <martin at v.loewis.de> wrote:
> 
> Josiah Carlson schrieb:
> > What about a tree structure over the top of the string as I described in
> > another post?  If there are no surrogate pairs, the pointer to the tree
> > is null.  If there are surrogate pairs, we could either use the
> > structure as I described, or even modify it so that we get even better
> > memory utilization/performance (choose tree nodes based on where
> > surrogate pairs are, up to some limit).
> 
> As always, it's a time-vs-space tradeoff. People tend to resolve these
> in favor of time, accepting an increase in space. I'm not so sure this
> is always the right answer. In the specific case, I'm also worried about
> the increase in complexness.
> 
> That said, it is always good to have a prototype implementation to
> analyse the consequences better.

I'm away from my main machine at the moment, so I am unable to test my
implementation, but I do have a sample.

There are two main functions to this implementation.  One which
constructs a tree for O(log n) worst-case access to character addresses,
and one which traverses the tree to discover the character address.  For
strings without surrogates, it's O(1) character address discovery.  The
implementation of surrogate discovery is very simple, using section 3.8
and 5.4 in the Unicode 4.0 standard.

If there are no surrogates, it takes a single pass over the input, and
constructs a single node (12 or 24 bytes, depending on the build, need
to replace long with Py_ssize_t).  If there are surrogates, it creates a
block of nodes, adjusts pointers to create a tree, and returns a pointer
to the root.  The tree will have at most O(n/logn) nodes, though it will
tend to create long blocks of non-surrogates, so that if you have a
single surrogate in the middle of a huge string, it will be conceptually
viewed as 3 blocks.


Attached is my untested sample implementation (I'm away for the next
week or so, and can't test), that should give an idea of what I was
talking about.

 - Josiah
-------------- next part --------------
A non-text attachment was scrubbed...
Name: surrogate_tree.c
Type: application/octet-stream
Size: 4688 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20060927/8a293362/attachment.obj 


More information about the Python-3000 mailing list