[Python-Dev] PEP 393 decode() oddity
Serhiy Storchaka
storchaka at gmail.com
Sun Mar 25 18:25:10 CEST 2012
PEP 393 (Flexible String Representation) is, without doubt, one of the
pearls of the Python 3.3. In addition to reducing memory consumption, it
also often leads to a corresponding increase in speed. In particular,
the string encoding now in 1.5-3 times faster.
But decoding is not so good. Here are the results of measuring the
performance of the decoding of the 1000-character string consisting of
characters from different ranges of the Unicode, for three versions of
Python -- 2.7.3rc2, 3.2.3rc2+ and 3.3.0a1+. Little-endian 32-bit i686
builds, gcc 4.4.
encoding string 2.7 3.2 3.3
ascii " " * 1000 5.4 5.3 1.2
latin1 " " * 1000 1.8 1.7 1.3
latin1 "\u0080" * 1000 1.7 1.6 1.0
utf-8 " " * 1000 6.7 2.4 2.1
utf-8 "\u0080" * 1000 12.2 11.0 13.0
utf-8 "\u0100" * 1000 12.2 11.1 13.6
utf-8 "\u0800" * 1000 14.7 14.4 17.2
utf-8 "\u8000" * 1000 13.9 13.3 17.1
utf-8 "\U00010000" * 1000 17.3 17.5 21.5
utf-16le " " * 1000 5.5 2.9 6.5
utf-16le "\u0080" * 1000 5.5 2.9 7.4
utf-16le "\u0100" * 1000 5.5 2.9 8.9
utf-16le "\u0800" * 1000 5.5 2.9 8.9
utf-16le "\u8000" * 1000 5.5 7.5 21.3
utf-16le "\U00010000" * 1000 9.6 12.9 30.1
utf-16be " " * 1000 5.5 3.0 9.0
utf-16be "\u0080" * 1000 5.5 3.1 9.8
utf-16be "\u0100" * 1000 5.5 3.1 10.4
utf-16be "\u0800" * 1000 5.5 3.1 10.4
utf-16be "\u8000" * 1000 5.5 6.6 21.2
utf-16be "\U00010000" * 1000 9.6 11.2 28.9
utf-32le " " * 1000 10.2 10.4 15.1
utf-32le "\u0080" * 1000 10.0 10.4 16.5
utf-32le "\u0100" * 1000 10.0 10.4 19.8
utf-32le "\u0800" * 1000 10.0 10.4 19.8
utf-32le "\u8000" * 1000 10.1 10.4 19.8
utf-32le "\U00010000" * 1000 11.7 11.3 20.2
utf-32be " " * 1000 10.0 11.2 15.0
utf-32be "\u0080" * 1000 10.1 11.2 16.4
utf-32be "\u0100" * 1000 10.0 11.2 19.7
utf-32be "\u0800" * 1000 10.1 11.2 19.7
utf-32be "\u8000" * 1000 10.1 11.2 19.7
utf-32be "\U00010000" * 1000 11.7 11.2 20.2
The first oddity in that the characters from the second half of the
Latin1 table decoded faster than the characters from the first half. I
think that the characters from the first half of the table must be
decoded as quickly.
The second sad oddity in that UTF-16 decoding in 3.3 is much slower than
even in 2.7. Compared with 3.2 decoding is slower in 2-3 times. This is
a considerable regress. UTF-32 decoding is also slowed down by 1.5-2 times.
The fact that in some cases UTF-8 decoding also slowed, is not
surprising. I believe, that on a platform with a 64-bit long, there may
be other oddities.
How serious a problem this is for the Python 3.3 release? I could do the
optimization, if someone is not working on this already.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bench_decode.py
Type: text/x-python
Size: 806 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-dev/attachments/20120325/a599326c/attachment.py>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bench_decode-2.py
Type: text/x-python
Size: 810 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-dev/attachments/20120325/a599326c/attachment-0001.py>
More information about the Python-Dev
mailing list