[Python-Dev] PEP 393 decode() oddity

Mon Mar 26 00:28:33 CEST 2012

Cool, Python 3.3 is *much* faster to decode pure ASCII :-)

> encoding  string                 2.7   3.2   3.3
>
> ascii     " " * 1000             5.4   5.3   1.2

4.5 faster than Python 2 here.

> utf-8     " " * 1000             6.7   2.4   2.1

3.2x faster

It's cool because in practice, a lot of strings are pure ASCII (as
Martin showed in its Django benchmark).

> latin1    " " * 1000             1.8   1.7   1.3
> latin1    "\u0080" * 1000        1.7   1.6   1.0
> ...
> The first oddity in that the characters from the second half of the Latin1
> table decoded faster than the characters from the first half.

The Latin1 decoder of Python 3.3 is *faster* than the decoder of
Python 2.7 and 3.2 according to your bench. So I don't see any issue
here :-) Martin explained why it is slower for pure ASCII.

> I think that the characters from the first half of the table
> must be decoded as quickly.

The Latin1 decoder is already heavily optimized, I don't see how to
make it faster.

> The second sad oddity in that UTF-16 decoding in 3.3 is much slower than
> even in 2.7. Compared with 3.2 decoding is slower in 2-3 times. This is a
> considerable regress. UTF-32 decoding is also slowed down by 1.5-2 times.

Only ASCII, latin1 and UTF-8 decoder are heavily optimized. We can do
better for UTF-16 and UTF-32.

I'm just less motivated because UTF-16/32 are less common than
ASCII/latin1/UTF-8.

> How serious a problem this is for the Python 3.3 release? I could do the
> optimization, if someone is not working on this already.

I'm interested by any patch optimizing any Python codecs. I'm not
working on optimizing Python Unicode anymore, various benchmarks
showed me that Python 3.3 is as good or faster than Python 3.2. That's
enough for me.

When Python 3.3 is slower than Python 3.2, it's because Python 3.3
must compute the maximum character of the result, and I fail to see
how to optimize this requirement. I already introduced many fast-path
where it was possible, like creating a substring of an ASCII string
(the result is ASCII, no need to scan the substring).

It doesn't mean that it is no more possible to optimize Python Unicode ;-)

Victor