[pypy-dev] Unicode encode/decode speed
Eleytherios Stamatogiannakis
estama at gmail.com
Mon Feb 11 18:02:04 CET 2013
On 11/02/13 18:13, Amaury Forgeot d'Arc wrote:
>
> 2013/2/11 Eleytherios Stamatogiannakis <estama at gmail.com
> <mailto:estama at gmail.com>>
>
> Right now we are using PyPy's "codecs.utf_8_encode" and
> "codecs.utf_8_decode" to do this conversion.
>
>
> It's the most direct way to use the utf-8 conversion functions.
>
> It there a faster way to do these conversions (encoding, decoding)
> in PyPy? Does CPython do something more clever than PyPY, like
> storing unicodes with full ASCII char content, in an ASCII
> representation?
>
>
> Over years, utf-8 conversions have been heavily optimized in CPython:
> allocate short buffers on the stack, use aligned reads, quick check for
> ascii-only content (data & 0x80808080)...
> All things that pypy does not.
>
> But I tried some "timeit" runs, and pypy is often faster that CPython,
> and never much slower.
This is odd. Maybe APSW uses some other CPython conversion API? Because
the conversion overhead is not visible on CPython + APSW profiles.
> Do your strings have many non-ascii characters?
> what's the len(utf8)/len(unicode) ratio?
>
Our current tests, are using plain ASCII input (imported into sqlite3)
which:
- Go from sqlite3 (UTF-8) -> PyPy (unicode)
- PyPy (unicode) -> sqlite3 (UTF-8).
So i guess the len(utf-8)/len(unicode) = 1/4
(assuming 1 byte per char for ASCII (UTF-8) and 4 bytes per char for
PyPy's unicode storage)
l.
More information about the pypy-dev
mailing list