PyPy 2 unicode class
At the Leysin Sprint Armin outlined a new design of the PyPy 2 unicode class. He gave two versions of the design: A: unicode with a UTF-8 implementation and a UTF-32 interface. B: unicode with a UTF-8 implementation, a UTF-16 interface on Windows and a UTF-32 interface on UNIX-like systems. (Armin claims that this design can be implemented efficiently.) Armin asked me (a Windows developer) which design I prefer. My answer is A. Why? Short answer: Because surrogate pairs are a pain in the ... ------- Long answer -------- ** A well-designed interface should accurately model the problem domain. ** In this case the problem domain contains the following abstract objects: 1. Unicode code points (i.e. integers in [0, 0x110000)) 2. Unicode characters 3. Unicode strings The relations between these abstract objects are: * There is a 1-1 correspondence between Unicode code points and Unicode characters. * Unicode strings are abstract sequences of Unicode characters. With a UTF-32 interface there is a 1-1 correspondence between these abstract objects and the following concrete Python 2 objects: 1'. int objects n with 0 <= n < 0x110000 2'. unicode objects s with len(s) == 1 3'. unicode objects With a UTF-16 interface there is not a 1-1 correspondence between 2 and 2'. Then we are not modelling the problem domain any more, and we have to deal with nonsense such as:
s = u'\N{PHAISTOS DISC SIGN FLUTE}' len(s) 2
Next, would such a change break any existing Python 2 code on Windows? Yes it will. For instance the following code for counting characters in a string: f = [0] * (1 << 16) for c in s: f[ord(c)] += 1 On the other hand, some broken code will be fixed by this change. There is a lot of Windows code that does not handle surrogate pairs correctly. For instance, in an earlier version of Notepad, you had to press Delete twice to delete a character with Unicode code point >= 0x10000. Such broken code might be fixed by this change. I believe that fixing broken code will be more common than breaking working code, so the overall effect on existing code will be beneficial. ---------------------- Finally, I very much dislike the term UTF-32 interface. You can, and should, explain that interface without mentioning encodings at all; just describe the problem domain and how the interface models the problem domain. Encodings should be an implementation detail, and should not leak through the interface at all. Python should make you think in terms of abstractions, not in terms of bytes. Python is not C. Let's call it the natural interface. (I hope this makes more sense than my ramblings on IRC last night.)
Hi Johan, On Wed, Jan 22, 2014 at 8:01 AM, Johan Råde <johan.rade@gmail.com> wrote:
(I hope this makes more sense than my ramblings on IRC last night.)
All versions you gave make sense as far as I'm concerned :-) But this last one is the clearest indeed. It seems that Python 3 went that way anyway too, and exposes the same "natural" interface on all platforms including Windows. I'm not saying it's a strong argument for us doing the same thing in PyPy *2*, but it's certainly an extra argument for it. I'd be prepared to say that no portable program should crash because it runs on the "wide" version of Python, even if Windows-only programs are not portable in that sense. The argument that it might actually *fix* more programs than it breaks is interesting too. As far as I'm concerned I'd vote for going with it tentatively (i.e. implementing unicodes as utf-8 strings internally, with an indexing cache). If it's really needed we can always add another layer on top of the implementation to give the utf-16 image of the world again. Anyway, it's a trade-off between raw performance (how can the indexing cache be made fast?) and memory usage, with a side note on RPython getting rid of the not-really-reusable unicode type. A bientôt, Armin.
On Wed, Jan 22, 2014 at 06:56:32PM +0100, Armin Rigo wrote:
Hi Johan,
On Wed, Jan 22, 2014 at 8:01 AM, Johan Råde <johan.rade@gmail.com> wrote:
(I hope this makes more sense than my ramblings on IRC last night.)
All versions you gave make sense as far as I'm concerned :-) But this last one is the clearest indeed.
It seems that Python 3 went that way anyway too, and exposes the same "natural" interface on all platforms including Windows.
I'm not saying it's a strong argument for us doing the same thing in PyPy *2*, but it's certainly an extra argument for it. I'd be prepared to say that no portable program should crash because it runs on the "wide" version of Python, even if Windows-only programs are not portable in that sense. The argument that it might actually *fix* more programs than it breaks is interesting too.
The distinction between narrow and wide builds of CPython is just a bug in CPython. Thankfully it is fixed as of CPython 3.3. PyPy should definitely not try to emulate that behaviour.
As far as I'm concerned I'd vote for going with it tentatively (i.e. implementing unicodes as utf-8 strings internally, with an indexing cache). If it's really needed we can always add another layer on top of the implementation to give the utf-16 image of the world again. Anyway, it's a trade-off between raw performance (how can the indexing cache be made fast?) and memory usage, with a side note on RPython getting rid of the not-really-reusable unicode type.
Steven wrote:
With a UTF-8 implementation, won't that mean that string indexing operations are O(N) rather than O(1)? E.g. how do you know which UTF-8 byte(s) to look at to get the character at index 42 without having to walk the string from the start?
With an indexing cache you do this by looking it up in the cache. If you cache the offset of the starting byte for every kth character then indexing can be O(k). Building the index is O(N) where N is the length of the string. If you do this on demand (the first time an index operation is attempted) then you'll often not need to do it. Subsequent indexing operations are O(k) where k is a controllable parameter. So there's a O(N) per-string cost if the string is indexed at least once and an O(k) cost per attempt to index into the string. OTOH using utf-8 internally means that encoding and decoding as utf-8 becomes O(1) instead of O(N) and uses less memory since if you do s=b.decode('utf-8') then s and b can share the same memory buffer. The FSR does this for latin-1 strings but utf-8 strings are more useful. So if you're more likely to decode/encode from/to utf-8 than you are to index into a long string that's a big saving. If the string comes from anything other than utf-8 the indexing cache can be built while decoding (and reencoding as utf-8 under the hood). Oscar
Hi Oscar, Thanks for explaining the caching in detail :-) On Thu, Jan 23, 2014 at 2:27 PM, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
big saving. If the string comes from anything other than utf-8 the indexing cache can be built while decoding (and reencoding as utf-8 under the hood).
Actually, you need to walk the string even to do "u = s.decode('utf-8')". The reason is that you need to check if the byte string is well-formed UTF-8 or not. So we can build the cache eagerly in all cases, it seems. A bientôt, Armin.
On Thu, Jan 23, 2014 at 01:27:50PM +0000, Oscar Benjamin wrote:
Steven wrote:
With a UTF-8 implementation, won't that mean that string indexing operations are O(N) rather than O(1)? E.g. how do you know which UTF-8 byte(s) to look at to get the character at index 42 without having to walk the string from the start?
With an indexing cache you do this by looking it up in the cache. [...] So there's a O(N) per-string cost if the string is indexed at least once and an O(k) cost per attempt to index into the string.
Plus a memory cost for the cache, and increased complexity for indexing operations.
OTOH using utf-8 internally means that encoding and decoding as utf-8 becomes O(1) instead of O(N) and uses less memory since if you do s=b.decode('utf-8') then s and b can share the same memory buffer.
CPython caches the UTF-8 encoding of (some? all?) strings too. It's not clear to me if they are cached lazily or eagerly, but either way, I'm not convinced by this argument that this is useful. In most applications, you'll decode from UTF-8 exactly once, when you read the string in (from a file, or the network, say), and you'll encode it at most once, when you write it out again. Many strings won't ever be encoded to UTF-8. I have difficulty imagining any application which will need to repeatedly encode the same string to UTF-8, where the cost of encoding is more significant than the cost of I/O. One possible exception might be printing to the terminal, say at the REPL, but even then, I don't imagine the time savings are worthwhile. I haven't noticed the REPL in Python 3.3 being particularly more responsive than that in 3.2. If there's a speed-up, it's invisible to the user, but at the cost of using extra memory for the UTF-8 cache.
The FSR does this for latin-1 strings but utf-8 strings are more useful. So if you're more likely to decode/encode from/to utf-8 than you are to index into a long string that's a big saving.
But I don't believe that is the case for any realistic use-case. Just about every string operation involves walking the string, which needs to walk the string in characters (code points), not UTF-8 bytes. At the Python layer, string indexing might be relatively uncommon, but at the implementation layer, I think every string operation relies on it. But having said all this, I know that using UTF-8 internally for strings is quite common (e.g. Haskell does it, without even an index cache, and documents that indexing operations can be slow). CPython's FSR has received much (in my opinion, entirely uninformed) criticism from one vocal person in particular for not using UTF-8 internally. If PyPy goes ahead with using UTF-8 internally, I look forward to comparing memory and time benchmarks of string operations between CPython and PyPy. -- Steven
On 23/1/2014 10:54 μμ, Steven D'Aprano wrote:
On Thu, Jan 23, 2014 at 01:27:50PM +0000, Oscar Benjamin wrote:
Steven wrote:
With a UTF-8 implementation, won't that mean that string indexing operations are O(N) rather than O(1)? E.g. how do you know which UTF-8 byte(s) to look at to get the character at index 42 without having to walk the string from the start?
With an indexing cache you do this by looking it up in the cache. [...] So there's a O(N) per-string cost if the string is indexed at least once and an O(k) cost per attempt to index into the string.
Plus a memory cost for the cache, and increased complexity for indexing operations.
OTOH using utf-8 internally means that encoding and decoding as utf-8 becomes O(1) instead of O(N) and uses less memory since if you do s=b.decode('utf-8') then s and b can share the same memory buffer.
CPython caches the UTF-8 encoding of (some? all?) strings too. It's not clear to me if they are cached lazily or eagerly, but either way, I'm not convinced by this argument that this is useful.
In most applications, you'll decode from UTF-8 exactly once, when you read the string in (from a file, or the network, say), and you'll encode it at most once, when you write it out again. Many strings won't ever be encoded to UTF-8. I have difficulty imagining any application which will need to repeatedly encode the same string to UTF-8, where the cost of encoding is more significant than the cost of I/O. One possible exception might be printing to the terminal, say at the REPL, but even then, I don't imagine the time savings are worthwhile. I haven't noticed the REPL in Python 3.3 being particularly more responsive than that in 3.2. If there's a speed-up, it's invisible to the user, but at the cost of using extra memory for the UTF-8 cache.
The FSR does this for latin-1 strings but utf-8 strings are more useful. So if you're more likely to decode/encode from/to utf-8 than you are to index into a long string that's a big saving.
But I don't believe that is the case for any realistic use-case. Just about every string operation involves walking the string, which needs to walk the string in characters (code points), not UTF-8 bytes. At the Python layer, string indexing might be relatively uncommon, but at the implementation layer, I think every string operation relies on it.
But having said all this, I know that using UTF-8 internally for strings is quite common (e.g. Haskell does it, without even an index cache, and documents that indexing operations can be slow). CPython's FSR has received much (in my opinion, entirely uninformed) criticism from one vocal person in particular for not using UTF-8 internally. If PyPy goes ahead with using UTF-8 internally, I look forward to comparing memory and time benchmarks of string operations between CPython and PyPy.
I have to admit that due to my work (databases and data processing), i'm biased towards I/O (UTF-8 is better due to size) rather than CPU. At least from my use cases, the most frequent operations that i do on strings are read, write, store, use them as keys in dicts, concatenate and split. For most of above things (with the exception of split maybe?), an index cache would not be needed, and UTF-8 due to its smaller size would be faster than wide unicode encodings. l.
On Thu, Jan 23, 2014 at 10:45:25PM +0200, Elefterios Stamatogiannakis wrote:
But having said all this, I know that using UTF-8 internally for strings is quite common (e.g. Haskell does it, without even an index cache, and documents that indexing operations can be slow). CPython's FSR has received much (in my opinion, entirely uninformed) criticism from one vocal person in particular for not using UTF-8 internally. If PyPy goes ahead with using UTF-8 internally, I look forward to comparing memory and time benchmarks of string operations between CPython and PyPy.
I have to admit that due to my work (databases and data processing), i'm biased towards I/O (UTF-8 is better due to size) rather than CPU.
At least from my use cases, the most frequent operations that i do on strings are read, write, store, use them as keys in dicts, concatenate and split.
For most of above things (with the exception of split maybe?), an index cache would not be needed, and UTF-8 due to its smaller size would be faster than wide unicode encodings.
I hear Steven's points, but my experience matches Elefterios' - smaller data is faster[1]. I'll also note that although many string processing algorithms can be written in terms of indexing, many(most?) are actually stream processing algorithms which do not actually need efficient character offset to/from byte offset calculations. For example, split works by walking the entire string in a single pass outputing substrings as it goes. regards, njh [1] which suggests that lz77ing longer strings by default is not a terrible idea.
On 23 January 2014 20:54, Steven D'Aprano <steve@pearwood.info> wrote:
On Thu, Jan 23, 2014 at 01:27:50PM +0000, Oscar Benjamin wrote:
Steven wrote:
With a UTF-8 implementation, won't that mean that string indexing operations are O(N) rather than O(1)? E.g. how do you know which UTF-8 byte(s) to look at to get the character at index 42 without having to walk the string from the start?
With an indexing cache you do this by looking it up in the cache. [...] So there's a O(N) per-string cost if the string is indexed at least once and an O(k) cost per attempt to index into the string.
Plus a memory cost for the cache, and increased complexity for indexing operations.
There is an O(n/k) memory cost for the cache, yes and indexing has an O(k) CPU cost.
OTOH using utf-8 internally means that encoding and decoding as utf-8 becomes O(1) instead of O(N) and uses less memory since if you do s=b.decode('utf-8') then s and b can share the same memory buffer.
CPython caches the UTF-8 encoding of (some? all?) strings too. It's not clear to me if they are cached lazily or eagerly, but either way, I'm not convinced by this argument that this is useful.
There's a difference between caching the utf-8 string and reusing it. If you cache the utf-8 string you're holding two memory buffers: one in FSR format and one in utf-8 format. If your implementation uses utf-8 then you can use one buffer for both the str and the corresponding bytes object.
In most applications, you'll decode from UTF-8 exactly once, when you read the string in (from a file, or the network, say), and you'll encode it at most once, when you write it out again. Many strings won't ever be encoded to UTF-8.
That's true but actually the concept of encoding/decoding is a conceptual error at the implementation level. Really all you ever do is re-encode. So why not choose the most common input/output format as the standard internal format if it saves conversion time?
I have difficulty imagining any application which will need to repeatedly encode the same string to UTF-8, where the cost of encoding is more significant than the cost of I/O.
It's not about repeatedly encoding the same string. It's about the fact that you always need to encode or decode the string at least once. If that's the case then minimising the cost of that one guaranteed encode/decode operation is a good thing. <snip>
The FSR does this for latin-1 strings but utf-8 strings are more useful. So if you're more likely to decode/encode from/to utf-8 than you are to index into a long string that's a big saving.
But I don't believe that is the case for any realistic use-case. Just about every string operation involves walking the string, which needs to walk the string in characters (code points), not UTF-8 bytes. At the Python layer, string indexing might be relatively uncommon, but at the implementation layer, I think every string operation relies on it.
Okay so here's a review of the Python str methods broken down into relevant categories. The first is all the methods that can operate on utf-8 strings by treating them as arbitrary byte strings. For example concatenating two utf-8 strings is equivalent to concatenating the corresponding abstract unicode character strings. I make the assumption here that the number of characters (not bytes) in the string is known (e.g. for ljust): join, ljust, partition, replace, rpartition, rjust, count, endswith, expandtabs, format, format_map, startswith, zfill, __add__, __contains__, __eq__, __ge__, __gt__, __hash__, __le__, __len__, __lt__, __mod__, __mul__, __ne__, __rmul__, __rmod__ The second group is those methods that must understand that the string is encoded in utf-8 but can walk through without needing to jump to an index in the middle of the string: isalnum, isalpha, isdecimal, isdigit, isidentifier, islower, isnumeric, isprintable, isspace, istitle, isupper, lstrip, lower, maketrans, casefold, center, encode, rsplit, rstrip, split, splitlines, strip, swapcase, title, translate, upper, __format__, __iter__ Finally here are the methods that need to be able to jump to an arbitrary character index within the string: find, index, rfind, rindex, __getitem__ In my own use I think that I would most commonly use (in)equality, concatenation, interpolation and formatting rather than indexing or slicing (__getitem__). None of these operation requires using the index cache.
But having said all this, I know that using UTF-8 internally for strings is quite common (e.g. Haskell does it, without even an index cache, and documents that indexing operations can be slow).
I find that Haskell is generally a bit slow with strings, utf-8 or otherwise. Haskell naturally lends itself to stream type processing so it's understandable that they would see efficient indexing as less important. Using linked-lists for strings is probably never going to be as efficient as well-implemented atomic string types.
CPython's FSR has received much (in my opinion, entirely uninformed) criticism from one vocal person in particular for not using UTF-8 internally.
I feel like you're putting a jinx on this list. Please don't encourage the "vocal person" to come here! Oscar
Hi all, Thanks everybody for your comments on this topic. Our initial motivation for doing that is to simplify RPython by getting rid of the RPython unicode type. I think that the outcome of these mails is that there is no single obvious answer as to whether the change would benefit or hurt Python applications. The benefits in reduced memory usage might in some applications win or loose to the higher cost of indexing. I personally guess that most applications would benefit marginally, with a few special cases that loose importantly. This would not be a very nice outcome... The design of the index cache is crucial to minimize that... On the other hand, maybe some applications doing I/O on utf-8 strings might benefit a lot. Once we have a good design, the only thing we can do is try it, and measure the results on a range of applications. A bientôt, Armin.
On Wed, Jan 22, 2014 at 08:01:31AM +0100, Johan Råde wrote:
At the Leysin Sprint Armin outlined a new design of the PyPy 2 unicode class. He gave two versions of the design:
A: unicode with a UTF-8 implementation and a UTF-32 interface.
B: unicode with a UTF-8 implementation, a UTF-16 interface on Windows and a UTF-32 interface on UNIX-like systems.
With a UTF-8 implementation, won't that mean that string indexing operations are O(N) rather than O(1)? E.g. how do you know which UTF-8 byte(s) to look at to get the character at index 42 without having to walk the string from the start? Have you considered the Flexible String Representation from CPython 3.3? http://www.python.org/dev/peps/pep-0393/ Basically, if the largest code point in the string is U+00FF or below, it is implemented using one byte per character (essentially Latin-1); if the largest code point is U+FFFF or below, it is implemented using two bytes per character (essentially UCS-2); otherwise, it is implemented using four bytes per character (UCS-4 or UTF-32). There's more to the FSR, read the PEP for further detail. -- Steven
On 2014-01-22 08:01, Johan Råde wrote:
Next, would such a change break any existing Python 2 code on Windows? Yes it will. For instance the following code for counting characters in a string:
f = [0] * (1 << 16) for c in s: f[ord(c)] += 1
I would like to qualify this statement. Getting rid of the UTF-16 interface for the unicode class in Python 2 on Windows will: * not break any well-written code * fix a lot of poorly written code. (The above code snippet is not well-written code.) --Johan
On Tue, Jan 21, 2014 at 11:01 PM, Johan Råde <johan.rade@gmail.com> wrote:
At the Leysin Sprint Armin outlined a new design of the PyPy 2 unicode class. He gave two versions of the design:
Why spend brain cycles on a Pypy unicode class, when you could just move on to Pypy3? The majority of the Python community is on-board with the 2-to-3 transition.
participants (7)
-
Armin Rigo
-
Dan Stromberg
-
Elefterios Stamatogiannakis
-
Johan Råde
-
Nathan Hurst
-
Oscar Benjamin
-
Steven D'Aprano