Mailman 3 PEP 393 decode() oddity - Python-Dev

25 Mar 2012

      PEP 393 (Flexible String Representation) is, without doubt, one of the 
pearls of the Python 3.3. In addition to reducing memory consumption, it 
also often leads to a corresponding increase in speed. In particular, 
the string encoding now in 1.5-3 times faster.

But decoding is not so good. Here are the results of measuring the 
performance of the decoding of the 1000-character string consisting of 
characters from different ranges of the Unicode, for three versions of 
Python -- 2.7.3rc2, 3.2.3rc2+ and 3.3.0a1+. Little-endian 32-bit i686 
builds, gcc 4.4.

encoding  string                 2.7   3.2   3.3

ascii     " " * 1000             5.4   5.3   1.2

latin1    " " * 1000             1.8   1.7   1.3
latin1    "\u0080" * 1000        1.7   1.6   1.0

utf-8     " " * 1000             6.7   2.4   2.1
utf-8     "\u0080" * 1000       12.2  11.0  13.0
utf-8     "\u0100" * 1000       12.2  11.1  13.6
utf-8     "\u0800" * 1000       14.7  14.4  17.2
utf-8     "\u8000" * 1000       13.9  13.3  17.1
utf-8     "\U00010000" * 1000   17.3  17.5  21.5

utf-16le  " " * 1000             5.5   2.9   6.5
utf-16le  "\u0080" * 1000        5.5   2.9   7.4
utf-16le  "\u0100" * 1000        5.5   2.9   8.9
utf-16le  "\u0800" * 1000        5.5   2.9   8.9
utf-16le  "\u8000" * 1000        5.5   7.5  21.3
utf-16le  "\U00010000" * 1000    9.6  12.9  30.1

utf-16be  " " * 1000             5.5   3.0   9.0
utf-16be  "\u0080" * 1000        5.5   3.1   9.8
utf-16be  "\u0100" * 1000        5.5   3.1  10.4
utf-16be  "\u0800" * 1000        5.5   3.1  10.4
utf-16be  "\u8000" * 1000        5.5   6.6  21.2
utf-16be  "\U00010000" * 1000    9.6  11.2  28.9

utf-32le  " " * 1000            10.2  10.4  15.1
utf-32le  "\u0080" * 1000       10.0  10.4  16.5
utf-32le  "\u0100" * 1000       10.0  10.4  19.8
utf-32le  "\u0800" * 1000       10.0  10.4  19.8
utf-32le  "\u8000" * 1000       10.1  10.4  19.8
utf-32le  "\U00010000" * 1000   11.7  11.3  20.2

utf-32be  " " * 1000            10.0  11.2  15.0
utf-32be  "\u0080" * 1000       10.1  11.2  16.4
utf-32be  "\u0100" * 1000       10.0  11.2  19.7
utf-32be  "\u0800" * 1000       10.1  11.2  19.7
utf-32be  "\u8000" * 1000       10.1  11.2  19.7
utf-32be  "\U00010000" * 1000   11.7  11.2  20.2

The first oddity in that the characters from the second half of the 
Latin1 table decoded faster than the characters from the first half. I 
think that the characters from the first half of the table must be 
decoded as quickly.

The second sad oddity in that UTF-16 decoding in 3.3 is much slower than 
even in 2.7. Compared with 3.2 decoding is slower in 2-3 times. This is 
a considerable regress. UTF-32 decoding is also slowed down by 1.5-2 times.

The fact that in some cases UTF-8 decoding also slowed, is not 
surprising. I believe, that on a platform with a 64-bit long, there may 
be other oddities.

How serious a problem this is for the Python 3.3 release? I could do the 
optimization, if someone is not working on this already.

PEP 393 decode() oddity

Serhiy Storchaka

Antoine Pitrou

Serhiy Storchaka

Paul Moore

martin＠v.loewis.de

Serhiy Storchaka

"Martin v. Löwis"

Victor Stinner

Serhiy Storchaka

Serhiy Storchaka

tags

participants (6)