[pypy-dev] PyPy 2 unicode class

Fri Jan 24 02:25:12 CET 2014

On 23 January 2014 20:54, Steven D'Aprano <steve at pearwood.info> wrote:
> On Thu, Jan 23, 2014 at 01:27:50PM +0000, Oscar Benjamin wrote:
>
>> Steven wrote:
>> > With a UTF-8 implementation, won't that mean that string indexing
>> > operations are O(N) rather than O(1)? E.g. how do you know which UTF-8
>> > byte(s) to look at to get the character at index 42 without having to
>> > walk the string from the start?
>>
>> With an indexing cache you do this by looking it up in the cache. [...]
>> So there's a O(N) per-string cost if the string is
>> indexed at least once and an O(k) cost per attempt to index into the string.
>
> Plus a memory cost for the cache, and increased complexity for indexing
> operations.

There is an O(n/k) memory cost for the cache, yes and indexing has an
O(k) CPU cost.

>> OTOH using utf-8 internally means that encoding and decoding as utf-8 becomes
>> O(1) instead of O(N) and uses less memory since if you do s=b.decode('utf-8')
>> then s and b can share the same memory buffer.
>
> CPython caches the UTF-8 encoding of (some? all?) strings too. It's not
> clear to me if they are cached lazily or eagerly, but either way, I'm
> not convinced by this argument that this is useful.

There's a difference between caching the utf-8 string and reusing it.
If you cache the utf-8 string you're holding two memory buffers: one
in FSR format and one in utf-8 format. If your implementation uses
utf-8 then you can use one buffer for both the str and the
corresponding bytes object.

> In most applications, you'll decode from UTF-8 exactly once, when you
> read the string in (from a file, or the network, say), and you'll encode
> it at most once, when you write it out again. Many strings won't ever be
> encoded to UTF-8.

That's true but actually the concept of encoding/decoding is a
conceptual error at the implementation level. Really all you ever do
is re-encode. So why not choose the most common input/output format as
the standard internal format if it saves conversion time?

> I have difficulty imagining any application which will
> need to repeatedly encode the same string to UTF-8, where the cost
> of encoding is more significant than the cost of I/O.

It's not about repeatedly encoding the same string. It's about the
fact that you always need to encode or decode the string at least
once. If that's the case then minimising the cost of that one
guaranteed encode/decode operation is a good thing.

<snip>
>
>> The FSR does this for latin-1
>> strings but utf-8 strings are more useful. So if you're more likely to
>> decode/encode from/to utf-8 than you are to index into a long string that's a
>> big saving.
>
> But I don't believe that is the case for any realistic use-case. Just
> about every string operation involves walking the string, which
> needs to walk the string in characters (code points), not UTF-8 bytes.
> At the Python layer, string indexing might be relatively uncommon, but
> at the implementation layer, I think every string operation relies on
> it.

Okay so here's a review of the Python str methods broken down into
relevant categories.

The first is all the methods that can operate on utf-8 strings by
treating them as arbitrary byte strings. For example concatenating two
utf-8 strings is equivalent to concatenating the corresponding
abstract unicode character strings. I make the assumption here that
the number of characters (not bytes) in the string is known (e.g. for
ljust):

join, ljust, partition, replace, rpartition, rjust, count, endswith,
expandtabs, format, format_map, startswith, zfill, __add__,
__contains__, __eq__, __ge__, __gt__, __hash__, __le__, __len__,
__lt__, __mod__, __mul__, __ne__, __rmul__, __rmod__

The second group is those methods that must understand that the string
is encoded in utf-8 but can walk through without needing to jump to an
index in the middle of the string:

isalnum, isalpha, isdecimal, isdigit, isidentifier, islower,
isnumeric, isprintable, isspace, istitle, isupper, lstrip, lower,
maketrans, casefold, center, encode, rsplit, rstrip, split,
splitlines, strip, swapcase, title, translate, upper, __format__,
__iter__

Finally here are the methods that need to be able to jump to an
arbitrary character index within the string:

find, index, rfind, rindex, __getitem__

In my own use I think that I would most commonly use (in)equality,
concatenation, interpolation and formatting rather than indexing or
slicing (__getitem__). None of these operation requires using the
index cache.

> But having said all this, I know that using UTF-8 internally for strings
> is quite common (e.g. Haskell does it, without even an index cache, and
> documents that indexing operations can be slow).

I find that Haskell is generally a bit slow with strings, utf-8 or
otherwise. Haskell naturally lends itself to stream type processing so
it's understandable that they would see efficient indexing as less
important. Using linked-lists for strings is probably never going to
be as efficient as well-implemented atomic string types.

> CPython's FSR has
> received much (in my opinion, entirely uninformed) criticism from one
> vocal person in particular for not using UTF-8 internally.

I feel like you're putting a jinx on this list. Please don't encourage
the "vocal person" to come here!

Oscar