PEP 393 decode() oddity
PEP 393 (Flexible String Representation) is, without doubt, one of the pearls of the Python 3.3. In addition to reducing memory consumption, it also often leads to a corresponding increase in speed. In particular, the string encoding now in 1.5-3 times faster. But decoding is not so good. Here are the results of measuring the performance of the decoding of the 1000-character string consisting of characters from different ranges of the Unicode, for three versions of Python -- 2.7.3rc2, 3.2.3rc2+ and 3.3.0a1+. Little-endian 32-bit i686 builds, gcc 4.4. encoding string 2.7 3.2 3.3 ascii " " * 1000 5.4 5.3 1.2 latin1 " " * 1000 1.8 1.7 1.3 latin1 "\u0080" * 1000 1.7 1.6 1.0 utf-8 " " * 1000 6.7 2.4 2.1 utf-8 "\u0080" * 1000 12.2 11.0 13.0 utf-8 "\u0100" * 1000 12.2 11.1 13.6 utf-8 "\u0800" * 1000 14.7 14.4 17.2 utf-8 "\u8000" * 1000 13.9 13.3 17.1 utf-8 "\U00010000" * 1000 17.3 17.5 21.5 utf-16le " " * 1000 5.5 2.9 6.5 utf-16le "\u0080" * 1000 5.5 2.9 7.4 utf-16le "\u0100" * 1000 5.5 2.9 8.9 utf-16le "\u0800" * 1000 5.5 2.9 8.9 utf-16le "\u8000" * 1000 5.5 7.5 21.3 utf-16le "\U00010000" * 1000 9.6 12.9 30.1 utf-16be " " * 1000 5.5 3.0 9.0 utf-16be "\u0080" * 1000 5.5 3.1 9.8 utf-16be "\u0100" * 1000 5.5 3.1 10.4 utf-16be "\u0800" * 1000 5.5 3.1 10.4 utf-16be "\u8000" * 1000 5.5 6.6 21.2 utf-16be "\U00010000" * 1000 9.6 11.2 28.9 utf-32le " " * 1000 10.2 10.4 15.1 utf-32le "\u0080" * 1000 10.0 10.4 16.5 utf-32le "\u0100" * 1000 10.0 10.4 19.8 utf-32le "\u0800" * 1000 10.0 10.4 19.8 utf-32le "\u8000" * 1000 10.1 10.4 19.8 utf-32le "\U00010000" * 1000 11.7 11.3 20.2 utf-32be " " * 1000 10.0 11.2 15.0 utf-32be "\u0080" * 1000 10.1 11.2 16.4 utf-32be "\u0100" * 1000 10.0 11.2 19.7 utf-32be "\u0800" * 1000 10.1 11.2 19.7 utf-32be "\u8000" * 1000 10.1 11.2 19.7 utf-32be "\U00010000" * 1000 11.7 11.2 20.2 The first oddity in that the characters from the second half of the Latin1 table decoded faster than the characters from the first half. I think that the characters from the first half of the table must be decoded as quickly. The second sad oddity in that UTF-16 decoding in 3.3 is much slower than even in 2.7. Compared with 3.2 decoding is slower in 2-3 times. This is a considerable regress. UTF-32 decoding is also slowed down by 1.5-2 times. The fact that in some cases UTF-8 decoding also slowed, is not surprising. I believe, that on a platform with a 64-bit long, there may be other oddities. How serious a problem this is for the Python 3.3 release? I could do the optimization, if someone is not working on this already.
Hi,
On Sun, 25 Mar 2012 19:25:10 +0300
Serhiy Storchaka
But decoding is not so good.
The general problem with decoding is that you don't know up front what width (1, 2 or 4 bytes) is required for the result. The solution is either to compute the width in a first pass (and decode in a second pass), or decode in a single pass and enlarge the result on the fly when needed. Both incur a slowdown compared to a single-size representation.
The first oddity in that the characters from the second half of the Latin1 table decoded faster than the characters from the first half. I think that the characters from the first half of the table must be decoded as quickly.
It's probably a measurement error on your part.
The second sad oddity in that UTF-16 decoding in 3.3 is much slower than even in 2.7. Compared with 3.2 decoding is slower in 2-3 times. This is a considerable regress. UTF-32 decoding is also slowed down by 1.5-2 times.
I don't think UTF-32 is used a lot. As for UTF-16, if you can optimize it then why not. Regards Antoine.
25.03.12 20:01, Antoine Pitrou написав(ла):
The general problem with decoding is that you don't know up front what width (1, 2 or 4 bytes) is required for the result. The solution is either to compute the width in a first pass (and decode in a second pass), or decode in a single pass and enlarge the result on the fly when needed. Both incur a slowdown compared to a single-size representation.
We can significantly reduce the number of checks, using the same trick that is used for fast checking of surrogate characters. While all characters < U+0100, we know that the result is a 1-byte string (ascii while all characters < U+0080). When meet a character >= U+0100, while all characters < U+10000, we know that the result is the 2-byte string. As soon as we met first character >= U+10000, we work with 4-bytes string. There will be several fast loops, the transition to the next loop will occur after the failure in the previous one.
It's probably a measurement error on your part.
Anyone can test. $ ./python -m timeit -s 'enc = "latin1"; import codecs; d = codecs.getdecoder(enc); x = ("\u0020" * 100000).encode(enc)' 'd(x)' 10000 loops, best of 3: 59.4 usec per loop $ ./python -m timeit -s 'enc = "latin1"; import codecs; d = codecs.getdecoder(enc); x = ("\u0080" * 100000).encode(enc)' 'd(x)' 10000 loops, best of 3: 28.4 usec per loop The results are fairly stable (±0.1 µsec) from run to run. It looks funny thing.
On 25 March 2012 19:51, Serhiy Storchaka
Anyone can test.
$ ./python -m timeit -s 'enc = "latin1"; import codecs; d = codecs.getdecoder(enc); x = ("\u0020" * 100000).encode(enc)' 'd(x)' 10000 loops, best of 3: 59.4 usec per loop $ ./python -m timeit -s 'enc = "latin1"; import codecs; d = codecs.getdecoder(enc); x = ("\u0080" * 100000).encode(enc)' 'd(x)' 10000 loops, best of 3: 28.4 usec per loop
The results are fairly stable (±0.1 µsec) from run to run. It looks funny thing.
Hmm, yes. I see the same results. Odd. PS D:\Data> py -3.3 -m timeit -s "enc = 'latin1'; import codecs; d = codecs.getdecoder(enc); x = ('\u0020' * 100000).encode(enc)" "d(x)" 10000 loops, best of 3: 37.3 usec per loop PS D:\Data> py -3.3 -m timeit -s "enc = 'latin1'; import codecs; d = codecs.getdecoder(enc); x = ('\u0080' * 100000).encode(enc)" "d(x)" 100000 loops, best of 3: 18 usec per loop PS D:\Data> py -3.3 -m timeit -s "enc = 'latin1'; import codecs; d = codecs.getdecoder(enc); x = ('\u0020' * 100000).encode(enc)" "d(x)" 10000 loops, best of 3: 37.6 usec per loop PS D:\Data> py -3.3 -m timeit -s "enc = 'latin1'; import codecs; d = codecs.getdecoder(enc); x = ('\u0080' * 100000).encode(enc)" "d(x)" 100000 loops, best of 3: 18.3 usec per loop PS D:\Data> py -3.3 -m timeit -s "enc = 'latin1'; import codecs; d = codecs.getdecoder(enc); x = ('\u0020' * 100000).encode(enc)" "d(x)" 10000 loops, best of 3: 37.8 usec per loop PS D:\Data> py -3.3 -m timeit -s "enc = 'latin1'; import codecs; d = codecs.getdecoder(enc); x = ('\u0080' * 100000).encode(enc)" "d(x)" 100000 loops, best of 3: 18.3 usec per loop Paul.
Anyone can test.
$ ./python -m timeit -s 'enc = "latin1"; import codecs; d = codecs.getdecoder(enc); x = ("\u0020" * 100000).encode(enc)' 'd(x)' 10000 loops, best of 3: 59.4 usec per loop $ ./python -m timeit -s 'enc = "latin1"; import codecs; d = codecs.getdecoder(enc); x = ("\u0080" * 100000).encode(enc)' 'd(x)' 10000 loops, best of 3: 28.4 usec per loop
The results are fairly stable (±0.1 µsec) from run to run. It looks funny thing.
This is not surprising. When decoding Latin-1, it needs to determine whether the string is pure ASCII or not. If it is not, it must be all Latin-1 (it can't be non-Latin-1). For a pure ASCII string, it needs to scan over the entire string, trying to find a non-ASCII character. If there is none, it has to inspect the entire string. In your example, as the first character is is already above 127, search for the maximum character can stop, so it needs to scan the string only once. Try '\u0020' * 999999+'\u0080', which is a non-ASCII string but still takes the same time as the pure ASCII string. Regards, Martin
How serious a problem this is for the Python 3.3 release? I could do the optimization, if someone is not working on this already.
I think the people who did the original implementation (Torsten, Victor, and myself) are done with optimizations. So: contributions are welcome. I'm not aware of any release-critical performance degradation (but I'd start with string formatting if I were you).
Cool, Python 3.3 is *much* faster to decode pure ASCII :-)
encoding string 2.7 3.2 3.3
ascii " " * 1000 5.4 5.3 1.2
4.5 faster than Python 2 here.
utf-8 " " * 1000 6.7 2.4 2.1
3.2x faster It's cool because in practice, a lot of strings are pure ASCII (as Martin showed in its Django benchmark).
latin1 " " * 1000 1.8 1.7 1.3 latin1 "\u0080" * 1000 1.7 1.6 1.0 ... The first oddity in that the characters from the second half of the Latin1 table decoded faster than the characters from the first half.
The Latin1 decoder of Python 3.3 is *faster* than the decoder of Python 2.7 and 3.2 according to your bench. So I don't see any issue here :-) Martin explained why it is slower for pure ASCII.
I think that the characters from the first half of the table must be decoded as quickly.
The Latin1 decoder is already heavily optimized, I don't see how to make it faster.
The second sad oddity in that UTF-16 decoding in 3.3 is much slower than even in 2.7. Compared with 3.2 decoding is slower in 2-3 times. This is a considerable regress. UTF-32 decoding is also slowed down by 1.5-2 times.
Only ASCII, latin1 and UTF-8 decoder are heavily optimized. We can do better for UTF-16 and UTF-32. I'm just less motivated because UTF-16/32 are less common than ASCII/latin1/UTF-8.
How serious a problem this is for the Python 3.3 release? I could do the optimization, if someone is not working on this already.
I'm interested by any patch optimizing any Python codecs. I'm not working on optimizing Python Unicode anymore, various benchmarks showed me that Python 3.3 is as good or faster than Python 3.2. That's enough for me. When Python 3.3 is slower than Python 3.2, it's because Python 3.3 must compute the maximum character of the result, and I fail to see how to optimize this requirement. I already introduced many fast-path where it was possible, like creating a substring of an ASCII string (the result is ASCII, no need to scan the substring). It doesn't mean that it is no more possible to optimize Python Unicode ;-) Victor
26.03.12 01:28, Victor Stinner написав(ла):
Cool, Python 3.3 is *much* faster to decode pure ASCII :-)
He even faster on large data. 1000 characters is not enough to completely neutralize the constant costs of the function calls. Python 3.3 is really cool.
encoding string 2.7 3.2 3.3
ascii " " * 1000 5.4 5.3 1.2
4.5 faster than Python 2 here.
And it can be accelerated (issue #14419).
utf-8 " " * 1000 6.7 2.4 2.1
3.2x faster
In theory, the speed must coincide with latin1 speed. And it coincides in the limit, for large data. For medium data starting overhead cost is visible and utf-8 is a bit slower than it could be.
It's cool because in practice, a lot of strings are pure ASCII (as Martin showed in its Django benchmark).
But there are a lot of non-ascii text. But with mostly-ascii text, containing at least one non-ascii character (for example, Martin's full name), utf-8 decoder copes much worse. And worse than in Python 3.2. The decoder may be slower only by a small amount, related to scanning. I believe that the stock to optimize exists.
I'm interested by any patch optimizing any Python codecs. I'm not working on optimizing Python Unicode anymore, various benchmarks showed me that Python 3.3 is as good or faster than Python 3.2. That's enough for me.
Then would you accept a patch, proposed by me in issue 14249? This patch will not catch up all arrears, but it is very simple and should not cause objections. Developed by me now optimization accelerates decoder even more, but so far it is too ugly spaghetti-code.
When Python 3.3 is slower than Python 3.2, it's because Python 3.3 must compute the maximum character of the result, and I fail to see how to optimize this requirement.
A significant slowdown was caused by the use of PyUnicode_WRITE with a variable kind in loop. In some cases, it would be useful to expand the loop in cascade of independent loops which fallback onto each other (as you have already done in utf8_scanner).
participants (6)
-
"Martin v. Löwis"
-
Antoine Pitrou
-
martin@v.loewis.de
-
Paul Moore
-
Serhiy Storchaka
-
Victor Stinner