flaming vs accuracy [was Re: Performance of int/long in Python 3]

Thu Mar 28 23:34:27 EDT 2013

Steven D'Aprano:

> Some string operations need to inspect every character, e.g. str.upper().
> Even for them, the increased complexity of a variable-width encoding
> costs. It's not sufficient to walk the string inspecting a fixed 1, 2 or
> 4 bytes per character. You have to walk the string grabbing 1 byte at a
> time, and then decide whether you need another 1, 2 or 3 bytes. Even
> though it's still O(N), the added bit-masking and overhead of variable-
> width encoding adds to the overall cost.

    It does add to implementation complexity but should only add a small 
amount of time.

    To compare costs, I am using the text of the web site 
http://www.mofa.go.jp/mofaj/ since it has a reasonable amount (10%) of 
multi-byte characters. Since the document fits in the the BMP, Python 
would choose a 2-byte wide implementation so I am emulating that choice 
with a very simple 16-bit table-based upper-caser. Real Unicode case 
conversion code is more concerned with edge cases like Turkic and 
Lithuanian locales and Greek combining characters and also allowing for 
measurement/reallocation for the cases where the result is 
smaller/larger. See, for example, glib's real_toupper in 
https://git.gnome.org/browse/glib/tree/glib/guniprop.c

    Here is some simplified example code that implements upper-casing 
over 16-bit wide (utf16_up) and UTF-8 (utf8_up) buffers:
http://www.scintilla.org/UTF8Up.cxx
    Since I didn't want to spend too much time writing code it only 
handles the BMP and doesn't have upper-case table entries outside ASCII 
for now. If this was going to be worked on further to be made 
maintainable, most of the masking and so forth would be in macros 
similar to UTF8_COMPUTE/UTF8_GET in glib.
    The UTF-8 case ranges from around 5% slower on average in a 32 bit 
release build (VC2012 on an i7 870) to averaging a little faster in a 
64-bit build. They're both around a billion characters per-second.

C:\u\hg\UpUTF\UpUTF>..\x64\Release\UpUTF.exe
Time taken for UTF8 of 80449=0.006528
Time taken for UTF16 of 71525=0.006610
Relative time taken UTF8/UTF16 0.987581

> Any string method that takes a starting offset requires the method to
> walk the string byte-by-byte. I've even seen languages put responsibility
> for dealing with that onto the programmer: the "start offset" is given in
> *bytes*, not characters. I don't remember what language this was... it
> might have been Haskell? Whatever it was, it horrified me.

    It doesn't horrify me - I've been working this way for over 10 years 
and it seems completely natural. You can wrap access in iterators that 
hide the byte offsets if you like. This then ensures that all operations 
on those iterators are safe only allowing the iterator to point at the 
start/end of valid characters.

> Sure. And over a different set of samples, it is less compact. If you
> write a lot of Latin-1, Python will use one byte per character, while
> UTF-8 will use two bytes per character.

    I think you mean writing a lot of Latin-1 characters outside ASCII. 
However, even people writing texts in, say, French will find that only a 
small proportion of their text is outside ASCII and so the cost of UTF-8 
is correspondingly small.

    The counter-problem is that a French document that needs to include 
one mathematical symbol (or emoji) outside Latin-1 will double in size 
as a Python string.

    Neil