flaming vs accuracy [was Re: Performance of int/long in Python 3]
nhodgson at iinet.net.au
Fri Mar 29 04:34:27 CET 2013
> Some string operations need to inspect every character, e.g. str.upper().
> Even for them, the increased complexity of a variable-width encoding
> costs. It's not sufficient to walk the string inspecting a fixed 1, 2 or
> 4 bytes per character. You have to walk the string grabbing 1 byte at a
> time, and then decide whether you need another 1, 2 or 3 bytes. Even
> though it's still O(N), the added bit-masking and overhead of variable-
> width encoding adds to the overall cost.
It does add to implementation complexity but should only add a small
amount of time.
To compare costs, I am using the text of the web site
http://www.mofa.go.jp/mofaj/ since it has a reasonable amount (10%) of
multi-byte characters. Since the document fits in the the BMP, Python
would choose a 2-byte wide implementation so I am emulating that choice
with a very simple 16-bit table-based upper-caser. Real Unicode case
conversion code is more concerned with edge cases like Turkic and
Lithuanian locales and Greek combining characters and also allowing for
measurement/reallocation for the cases where the result is
smaller/larger. See, for example, glib's real_toupper in
Here is some simplified example code that implements upper-casing
over 16-bit wide (utf16_up) and UTF-8 (utf8_up) buffers:
Since I didn't want to spend too much time writing code it only
handles the BMP and doesn't have upper-case table entries outside ASCII
for now. If this was going to be worked on further to be made
maintainable, most of the masking and so forth would be in macros
similar to UTF8_COMPUTE/UTF8_GET in glib.
The UTF-8 case ranges from around 5% slower on average in a 32 bit
release build (VC2012 on an i7 870) to averaging a little faster in a
64-bit build. They're both around a billion characters per-second.
Time taken for UTF8 of 80449=0.006528
Time taken for UTF16 of 71525=0.006610
Relative time taken UTF8/UTF16 0.987581
> Any string method that takes a starting offset requires the method to
> walk the string byte-by-byte. I've even seen languages put responsibility
> for dealing with that onto the programmer: the "start offset" is given in
> *bytes*, not characters. I don't remember what language this was... it
> might have been Haskell? Whatever it was, it horrified me.
It doesn't horrify me - I've been working this way for over 10 years
and it seems completely natural. You can wrap access in iterators that
hide the byte offsets if you like. This then ensures that all operations
on those iterators are safe only allowing the iterator to point at the
start/end of valid characters.
> Sure. And over a different set of samples, it is less compact. If you
> write a lot of Latin-1, Python will use one byte per character, while
> UTF-8 will use two bytes per character.
I think you mean writing a lot of Latin-1 characters outside ASCII.
However, even people writing texts in, say, French will find that only a
small proportion of their text is outside ASCII and so the cost of UTF-8
is correspondingly small.
The counter-problem is that a French document that needs to include
one mathematical symbol (or emoji) outside Latin-1 will double in size
as a Python string.
More information about the Python-list