[pypy-dev] Unicode encode/decode speed

Mon Feb 11 18:02:04 CET 2013

On 11/02/13 18:13, Amaury Forgeot d'Arc wrote:
>
> 2013/2/11 Eleytherios Stamatogiannakis <estama at gmail.com
> <mailto:estama at gmail.com>>
>
>     Right now we are using PyPy's "codecs.utf_8_encode" and
>     "codecs.utf_8_decode" to do this conversion.
>
>
> It's the most direct way to use the utf-8 conversion functions.
>
>     It there a faster way to do these conversions (encoding, decoding)
>     in PyPy? Does CPython do something more clever than PyPY, like
>     storing unicodes with full ASCII char content, in an ASCII
>     representation?
>
>
> Over years, utf-8 conversions have been heavily optimized in CPython:
> allocate short buffers on the stack, use aligned reads, quick check for
> ascii-only content (data & 0x80808080)...
> All things that pypy does not.
>
> But I tried some "timeit" runs, and pypy is often faster that CPython,
> and never much slower.

This is odd. Maybe APSW uses some other CPython conversion API? Because 
the conversion overhead is not visible on CPython + APSW profiles.

> Do your strings have many non-ascii characters?
> what's the len(utf8)/len(unicode) ratio?
>

Our current tests, are using plain ASCII input (imported into sqlite3) 
which:

- Go from sqlite3 (UTF-8) -> PyPy (unicode)
- PyPy (unicode) -> sqlite3 (UTF-8).

So i guess the len(utf-8)/len(unicode) = 1/4
(assuming 1 byte per char for ASCII (UTF-8) and 4 bytes per char for 
PyPy's unicode storage)

l.