[Python-Dev] PEP 393 review

Mon Aug 29 21:34:48 CEST 2011

>> Those haven't been ported to the new API, yet. Consider, for example,
>> d9821affc9ee. Before that, I got 253 MB/s on the 4096 units read test;
>> with that change, I get 610 MB/s. The trunk gives me 488 MB/s, so this
>> is a 25% speedup for PEP 393.
> 
> If I understand correctly, the performance now highly depend on the used
> characters? A pure ASCII string is faster than a string with characters
> in the ISO-8859-1 charset?

How did you infer that from above paragraph??? ASCII and Latin-1 are
mostly identical in terms of performance - the ASCII decoder should be
slightly slower than the Latin-1 decoder, since the ASCII decoder needs
to check for errors, whereas the Latin-1 decoder will never be
confronted with errors.

What matters is
a) is the codec already rewritten to use the new representation, or
   must it go through Py_UNICODE[] first, requiring then a second copy
   to the canonical form?
b) what is the cost of finding out the highest character? - regardless
   of what the highest character turns out to be

> Is it also true for BMP characters vs non-BMP
> characters?

Well... If you are talking about the ASCII and Latin-1 codecs - neither
of these support most BMP characters, let alone non-BMP characters.
In general, non-BMP characters are more expensive to process since they
take more space.

> Do these benchmark tools use only ASCII characters, or also some
> ISO-8859-1 characters?

See for yourself. iobench uses Latin-1, including non-ASCII, but not
non-Latin-1.

> Or, better, different Unicode ranges in different tests?

That's why I asked for a list of benchmarks to perform. I cannot
run an infinite number of benchmarks prior to adoption of the PEP.

Regards,
Martin