[Python-Dev] PEP 393 decode() oddity
Serhiy Storchaka
storchaka at gmail.com
Tue Mar 27 00:04:05 CEST 2012
26.03.12 01:28, Victor Stinner написав(ла):
> Cool, Python 3.3 is *much* faster to decode pure ASCII :-)
He even faster on large data. 1000 characters is not enough to
completely neutralize the constant costs of the function calls. Python
3.3 is really cool.
>> encoding string 2.7 3.2 3.3
>>
>> ascii " " * 1000 5.4 5.3 1.2
>
> 4.5 faster than Python 2 here.
And it can be accelerated (issue #14419).
>> utf-8 " " * 1000 6.7 2.4 2.1
>
> 3.2x faster
In theory, the speed must coincide with latin1 speed. And it coincides
in the limit, for large data. For medium data starting overhead cost is
visible and utf-8 is a bit slower than it could be.
> It's cool because in practice, a lot of strings are pure ASCII (as
> Martin showed in its Django benchmark).
But there are a lot of non-ascii text. But with mostly-ascii text,
containing at least one non-ascii character (for example, Martin's full
name), utf-8 decoder copes much worse. And worse than in Python 3.2.
The decoder may be slower only by a small amount, related to scanning. I
believe that the stock to optimize exists.
> I'm interested by any patch optimizing any Python codecs. I'm not
> working on optimizing Python Unicode anymore, various benchmarks
> showed me that Python 3.3 is as good or faster than Python 3.2. That's
> enough for me.
Then would you accept a patch, proposed by me in issue 14249? This patch
will not catch up all arrears, but it is very simple and should not
cause objections. Developed by me now optimization accelerates decoder
even more, but so far it is too ugly spaghetti-code.
> When Python 3.3 is slower than Python 3.2, it's because Python 3.3
> must compute the maximum character of the result, and I fail to see
> how to optimize this requirement.
A significant slowdown was caused by the use of PyUnicode_WRITE with a
variable kind in loop. In some cases, it would be useful to expand the
loop in cascade of independent loops which fallback onto each other (as
you have already done in utf8_scanner).
More information about the Python-Dev
mailing list