Re: [pypy-dev] Unicode encode/decode speed

Feb. 11, 2013


      2013/2/11 Eleytherios Stamatogiannakis <estama@gmail.com>
...
On 11/02/13 18:13, Amaury Forgeot d'Arc wrote:
...
2013/2/11 Eleytherios Stamatogiannakis <estama@gmail.com
<mailto:estama@gmail.com>>
Right now we are using PyPy's "codecs.utf_8_encode" and
    "codecs.utf_8_decode" to do this conversion.
It's the most direct way to use the utf-8 conversion functions.
It there a faster way to do these conversions (encoding, decoding)
    in PyPy? Does CPython do something more clever than PyPY, like
    storing unicodes with full ASCII char content, in an ASCII
    representation?
Over years, utf-8 conversions have been heavily optimized in CPython:
allocate short buffers on the stack, use aligned reads, quick check for
ascii-only content (data & 0x80808080)...
All things that pypy does not.
But I tried some "timeit" runs, and pypy is often faster that CPython,
and never much slower.
This is odd. Maybe APSW uses some other CPython conversion API? Because
the conversion overhead is not visible on CPython + APSW profiles.
Which kind of profiler are you using? It possible that CPython builtin
functions are not profiled the same way as PyPy's.
...
Do your strings have many non-ascii characters?
...
what's the len(utf8)/len(unicode) ratio?
Our current tests, are using plain ASCII input (imported into sqlite3)
which:
- Go from sqlite3 (UTF-8) -> PyPy (unicode)
- PyPy (unicode) -> sqlite3 (UTF-8).
So i guess the len(utf-8)/len(unicode) = 1/4
(assuming 1 byte per char for ASCII (UTF-8) and 4 bytes per char for
PyPy's unicode storage)
No, my question was about the number of non-ascii characters:
    s = u"SomeUnicodeString"
    1.0 * len(s.encode('utf8')) / len(s)
PyPy allocates the StringBuffer upfront, and must realloc to cope with
multibytes characters.
For English text, ratio is 1.0; for Greek, it will be close to 2.0.

-- 
Amaury Forgeot d'Arc