[pypy-dev] PyPy 2 unicode class

Thu Jan 23 14:27:50 CET 2014

On Wed, Jan 22, 2014 at 06:56:32PM +0100, Armin Rigo wrote:
> Hi Johan,
> 
> On Wed, Jan 22, 2014 at 8:01 AM, Johan Råde <johan.rade at gmail.com> wrote:
> > (I hope this makes more sense than my ramblings on IRC last night.)
> 
> All versions you gave make sense as far as I'm concerned :-)  But this
> last one is the clearest indeed.
> 
> It seems that Python 3 went that way anyway too, and exposes the same
> "natural" interface on all platforms including Windows.
> 
> I'm not saying it's a strong argument for us doing the same thing in
> PyPy *2*, but it's certainly an extra argument for it.  I'd be
> prepared to say that no portable program should crash because it runs
> on the "wide" version of Python, even if Windows-only programs are not
> portable in that sense.  The argument that it might actually *fix*
> more programs than it breaks is interesting too.

The distinction between narrow and wide builds of CPython is just a bug in
CPython. Thankfully it is fixed as of CPython 3.3. PyPy should definitely not
try to emulate that behaviour.

> 
> As far as I'm concerned I'd vote for going with it tentatively (i.e.
> implementing unicodes as utf-8 strings internally, with an indexing
> cache).  If it's really needed we can always add another layer on top
> of the implementation to give the utf-16 image of the world again.
> Anyway, it's a trade-off between raw performance (how can the indexing
> cache be made fast?) and memory usage, with a side note on RPython
> getting rid of the not-really-reusable unicode type.

Steven wrote:
> With a UTF-8 implementation, won't that mean that string indexing
> operations are O(N) rather than O(1)? E.g. how do you know which UTF-8
> byte(s) to look at to get the character at index 42 without having to
> walk the string from the start?

With an indexing cache you do this by looking it up in the cache. If you cache
the offset of the starting byte for every kth character then indexing can be
O(k). Building the index is O(N) where N is the length of the string. If you
do this on demand (the first time an index operation is attempted) then you'll
often not need to do it. Subsequent indexing operations are O(k) where k is a
controllable parameter. So there's a O(N) per-string cost if the string is
indexed at least once and an O(k) cost per attempt to index into the string.

OTOH using utf-8 internally means that encoding and decoding as utf-8 becomes
O(1) instead of O(N) and uses less memory since if you do s=b.decode('utf-8')
then s and b can share the same memory buffer. The FSR does this for latin-1
strings but utf-8 strings are more useful. So if you're more likely to
decode/encode from/to utf-8 than you are to index into a long string that's a
big saving. If the string comes from anything other than utf-8 the indexing
cache can be built while decoding (and reencoding as utf-8 under the hood).

Oscar