Optimize Unicode strings in Python 3.3

Hi, Different people are working on improving performances of Unicode strings in Python 3.3. This Python version is very different from Python 3.2 because of the PEP 393, and it is still unclear to me what is the best way to create a new Unicode string. There are different approachs: * Use the legacy (Py_UNICODE) API, PyUnicode_READY() converts the result to the canonical form. CJK codecs are still using this API. * Use a Py_UCS4 buffer and then convert to the canonical form (ASCII, UCS1 or UCS2). Approach taken by io.StringIO. io.StringIO is not only used to write, but also to read and so a Py_UCS4 buffer is a good compromise. * PyAccu API: optimized version of chunks=[]; for ...: ... chunks.append(text); return ''.join(chunks). * Two steps: compute the length and maximum character of the output string, allocate the output string and then write characters. str%args was using it. * Optimistic approach. Start with a ASCII buffer, enlarge and widen (to UCS2 and then UCS4) the buffer when new characters are written. Approach used by the UTF-8 decoder and by str%args since today. The optimistic approach uses realloc() to resize the string. It is faster than the PyAccu approach (at least for short ASCII strings), maybe because it avoids the creating of temporary short strings. realloc() looks to be efficient on Linux and Windows (at least Seven). Various notes: * PyUnicode_READ() is slower than reading a Py_UNICODE array. * Some decoders unroll the main loop to process 4 or 8 bytes (32 or 64 bits CPU) at each step. I am interested if you know other tricks to optimize Unicode strings in Python, or if you are interested to work on this topic. There are open issues related to optimizing Unicode: #11313: Speed up default encode()/decode() #12807: Optimization/refactoring for {bytearray, bytes, unicode}.strip() #14419: Faster ascii decoding #14624: Faster utf-16 decoder #14625: Faster utf-32 decoder #14654: More fast utf-8 decoding #14716: Use unicode_writer API for str.format() Victor

Beyond creation, the most frequent approach is to specialize loops for all three possible width, allowing the compiler to hard-code the element size. This brings it back in performance to the speed of accessing a Py_UNICODE array (or faster for 1-byte strings). A possible micro-optimization might be to use pointer arithmetic instead of indexing. However, I would expect that compilers will already convert a counting loop into pointer arithmetic if the index is only ever used for array access. A source of slow-down appears to be widening copy operations. I wonder whether microprocessors are able to do this faster than what the compiler generates out of a naive copying loop. Another potential area for further optimization is to better pass-through PyObject*. Some APIs still use char* or Py_UNICODE*, when the caller actually holds a PyObject*, and the callee ultimate recreates an object out of the pointers being passed. Some people (hi Larry) still think that using a rope representation for string concatenation might improve things, see #1569040. Regards, Martin

04.05.12 02:45, Victor Stinner написав(ла):
In real today UTF-8 decoder uses two-steps approach. Only after encountering an error it switches to optimistic approach.
IMHO, realloc() has no relationship to this. The case in the cost of managing of the list and creating of temporary strings.
Various notes: * PyUnicode_READ() is slower than reading a Py_UNICODE array.
And PyUnicode_WRITE() is slower than writing a Py_UNICODE/PyUCS* array.
* Some decoders unroll the main loop to process 4 or 8 bytes (32 or 64 bits CPU) at each step.
Note, this is not only CPU-, but OS-depending (LP64 vs LLP64).
I am interested if you know other tricks to optimize Unicode strings in Python, or if you are interested to work on this topic.
Optimized ASCII decoder (issue 14419) is not only reads 4 or 8 bytes at a time, but writes them all at a time. This is a very specific optimization. More general principle is replacing serial scanning and translating on an one-pass optimistic reading and writing. This improves the efficiency of the memory cache. I'm going to try it in UTF-8 decoder, it will allow to increase the speed of decoding ASCII-only strings up to speed of optimized ASCII decoder.

Hi,
I ran extensive benchmarks on these 4 methods for str%args and str.format(args). The "two steps" method is not promising: parsing the format string twice is slower than other methods. The PyAccu API is faster than a Py_UCS4 buffer to concatenate a lot of strings, but it is slower in many other cases. I implemented the last method as the new internal "_PyUnicodeWriter" API: resize / widen the string buffer when writing new characters. I implemented more optimizations: * overallocate the buffer to limit the cost of realloc() * write characters directly in the buffer, avoid temporary buffers when possible (it is possible in most cases) * disable overallocation when formating the last argument * don't copy by value but copy by reference if the result is just a string (optimization already implemented indirectly in the PyAccu API) The _PyUnicodeWriter is the fastest method: it gives a speed up of 30% over the Py_UCS4 / PyAccu in general, and from 60% to 100% in some specific cases! I also compared str%args and str.format() with Python 2.7 (byte strings), 3.2 (UTF-16 or UCS-4) and 3.3 (PEP 393): Python 3.3 is as fast as Python 2.7 and sometimes faster! (Whereras Python 3.2 is 10 to 30% slower than Python 2 in general) -- I wrote a tool to run benchmarks and to compare results: https://bitbucket.org/haypo/misc/src/tip/python/benchmark.py https://bitbucket.org/haypo/misc/src/tip/python/bench_str.py Run the benchmark: ./python benchmark.py --file=FILE script bench_str.py Compare results: ./python benchmark.py compare_to FILE1 FILE2 FILE3 ... -- Python 2.7 vs 3.2 vs 3.3: http://bugs.python.org/file25685/REPORT_32BIT_2.7_3.2_writer http://bugs.python.org/file25687/REPORT_64BIT_2.7_3.2_writer http://bugs.python.org/file25757/report_windows7 Warning: For the Windows benchmark, Python 3.3 is compiled in 32 bits, whereas 2.7 and 3.2 are compiled in 64 bits (formatting integers is slower in 32 bits). -- UCS4 vs PyAccu vs _PyUnicodeWriter: http://bugs.python.org/file25686/REPORT_32BIT_3.3 http://bugs.python.org/file25688/REPORT_64BIT_3.3 Victor

On 30.05.12 01:44, Victor Stinner wrote:
The "two steps" method is not promising: parsing the format string twice is slower than other methods.
The "1.5 steps" method is more promising -- first parse the format string in an efficient internal representation, and then allocate the output string and then write characters (or enlarge and widen the buffer, but with more information in any case). The internal representation can be cached (as for struct module) that for a repeated formatting will reduce the cost of parsing to zero.

I implemented something like that, and it was not efficient and very complex. See for example the (incomplete) patch for str%args attached to the issue #14687: http://bugs.python.org/file25413/pyunicode_format-2.patch IMO this approach is less efficient than the "Unicode writer" approach because: - you have to create many substrings or temporary strings in the first step, or (worse) compute each argument twice: the writer approach is more efficient here because it avoids computing substrings and temporary strings - you have to parse the format string twice, or you have to write two versions of the code: first create a list of items, then concatenate items. The PyAccu method concatenates substrings at the end, it is less efficient than the writer method (e.g. it has to create a string of N fill characters to pad to WIDTH characters). - the code is more complex than the writer method (which is very similar to what is used in Python 2.7 and 3.2) I wrote a much more complex patch for str%args to remember variables of the first step to avoid most of the parsing work in the second step. The patch was very complex and hard to maintain. I chose to not publish it and try another approach (the Unicode writer). Note: I'm talking about str%args and str.format(args), the Unicode writer is not the most efficient method for *any* function creating strings! Victor

On 30.05.12 14:26, Victor Stinner wrote:
I have seen and commented on this patch. That's not what I'm talking about.
IMO this approach is less efficient than the "Unicode writer" approach because:
I brought this approach is not for the opposition of the "Unicode writer", and for comparison with a straight "two steps" method. Of course, this can be combined with the "Unicode writer" to get the benefits of both methods. For example, you can advance to widen the output buffer to a width of format string, or disable overallocation when formating the last argument with non-empty suffix.
Not on the first step but on the second step (and this is the only single step if you use caching), if you use the "Unicode writer".
The code is divided into the compiler and the interpreter. Only the first one parses the format string. See Modules/_struct.c.
- the code is more complex than the writer method (which is very similar to what is used in Python 2.7 and 3.2)
The code that uses the writer method to be rather complicated, the difference in the total complexity of these approaches has become smaller. ;-) But it is really not easy work, not assure success, so let waits for its time.

Beyond creation, the most frequent approach is to specialize loops for all three possible width, allowing the compiler to hard-code the element size. This brings it back in performance to the speed of accessing a Py_UNICODE array (or faster for 1-byte strings). A possible micro-optimization might be to use pointer arithmetic instead of indexing. However, I would expect that compilers will already convert a counting loop into pointer arithmetic if the index is only ever used for array access. A source of slow-down appears to be widening copy operations. I wonder whether microprocessors are able to do this faster than what the compiler generates out of a naive copying loop. Another potential area for further optimization is to better pass-through PyObject*. Some APIs still use char* or Py_UNICODE*, when the caller actually holds a PyObject*, and the callee ultimate recreates an object out of the pointers being passed. Some people (hi Larry) still think that using a rope representation for string concatenation might improve things, see #1569040. Regards, Martin

04.05.12 02:45, Victor Stinner написав(ла):
In real today UTF-8 decoder uses two-steps approach. Only after encountering an error it switches to optimistic approach.
IMHO, realloc() has no relationship to this. The case in the cost of managing of the list and creating of temporary strings.
Various notes: * PyUnicode_READ() is slower than reading a Py_UNICODE array.
And PyUnicode_WRITE() is slower than writing a Py_UNICODE/PyUCS* array.
* Some decoders unroll the main loop to process 4 or 8 bytes (32 or 64 bits CPU) at each step.
Note, this is not only CPU-, but OS-depending (LP64 vs LLP64).
I am interested if you know other tricks to optimize Unicode strings in Python, or if you are interested to work on this topic.
Optimized ASCII decoder (issue 14419) is not only reads 4 or 8 bytes at a time, but writes them all at a time. This is a very specific optimization. More general principle is replacing serial scanning and translating on an one-pass optimistic reading and writing. This improves the efficiency of the memory cache. I'm going to try it in UTF-8 decoder, it will allow to increase the speed of decoding ASCII-only strings up to speed of optimized ASCII decoder.

Hi,
I ran extensive benchmarks on these 4 methods for str%args and str.format(args). The "two steps" method is not promising: parsing the format string twice is slower than other methods. The PyAccu API is faster than a Py_UCS4 buffer to concatenate a lot of strings, but it is slower in many other cases. I implemented the last method as the new internal "_PyUnicodeWriter" API: resize / widen the string buffer when writing new characters. I implemented more optimizations: * overallocate the buffer to limit the cost of realloc() * write characters directly in the buffer, avoid temporary buffers when possible (it is possible in most cases) * disable overallocation when formating the last argument * don't copy by value but copy by reference if the result is just a string (optimization already implemented indirectly in the PyAccu API) The _PyUnicodeWriter is the fastest method: it gives a speed up of 30% over the Py_UCS4 / PyAccu in general, and from 60% to 100% in some specific cases! I also compared str%args and str.format() with Python 2.7 (byte strings), 3.2 (UTF-16 or UCS-4) and 3.3 (PEP 393): Python 3.3 is as fast as Python 2.7 and sometimes faster! (Whereras Python 3.2 is 10 to 30% slower than Python 2 in general) -- I wrote a tool to run benchmarks and to compare results: https://bitbucket.org/haypo/misc/src/tip/python/benchmark.py https://bitbucket.org/haypo/misc/src/tip/python/bench_str.py Run the benchmark: ./python benchmark.py --file=FILE script bench_str.py Compare results: ./python benchmark.py compare_to FILE1 FILE2 FILE3 ... -- Python 2.7 vs 3.2 vs 3.3: http://bugs.python.org/file25685/REPORT_32BIT_2.7_3.2_writer http://bugs.python.org/file25687/REPORT_64BIT_2.7_3.2_writer http://bugs.python.org/file25757/report_windows7 Warning: For the Windows benchmark, Python 3.3 is compiled in 32 bits, whereas 2.7 and 3.2 are compiled in 64 bits (formatting integers is slower in 32 bits). -- UCS4 vs PyAccu vs _PyUnicodeWriter: http://bugs.python.org/file25686/REPORT_32BIT_3.3 http://bugs.python.org/file25688/REPORT_64BIT_3.3 Victor

On 30.05.12 01:44, Victor Stinner wrote:
The "two steps" method is not promising: parsing the format string twice is slower than other methods.
The "1.5 steps" method is more promising -- first parse the format string in an efficient internal representation, and then allocate the output string and then write characters (or enlarge and widen the buffer, but with more information in any case). The internal representation can be cached (as for struct module) that for a repeated formatting will reduce the cost of parsing to zero.

I implemented something like that, and it was not efficient and very complex. See for example the (incomplete) patch for str%args attached to the issue #14687: http://bugs.python.org/file25413/pyunicode_format-2.patch IMO this approach is less efficient than the "Unicode writer" approach because: - you have to create many substrings or temporary strings in the first step, or (worse) compute each argument twice: the writer approach is more efficient here because it avoids computing substrings and temporary strings - you have to parse the format string twice, or you have to write two versions of the code: first create a list of items, then concatenate items. The PyAccu method concatenates substrings at the end, it is less efficient than the writer method (e.g. it has to create a string of N fill characters to pad to WIDTH characters). - the code is more complex than the writer method (which is very similar to what is used in Python 2.7 and 3.2) I wrote a much more complex patch for str%args to remember variables of the first step to avoid most of the parsing work in the second step. The patch was very complex and hard to maintain. I chose to not publish it and try another approach (the Unicode writer). Note: I'm talking about str%args and str.format(args), the Unicode writer is not the most efficient method for *any* function creating strings! Victor

On 30.05.12 14:26, Victor Stinner wrote:
I have seen and commented on this patch. That's not what I'm talking about.
IMO this approach is less efficient than the "Unicode writer" approach because:
I brought this approach is not for the opposition of the "Unicode writer", and for comparison with a straight "two steps" method. Of course, this can be combined with the "Unicode writer" to get the benefits of both methods. For example, you can advance to widen the output buffer to a width of format string, or disable overallocation when formating the last argument with non-empty suffix.
Not on the first step but on the second step (and this is the only single step if you use caching), if you use the "Unicode writer".
The code is divided into the compiler and the interpreter. Only the first one parses the format string. See Modules/_struct.c.
- the code is more complex than the writer method (which is very similar to what is used in Python 2.7 and 3.2)
The code that uses the writer method to be rather complicated, the difference in the total complexity of these approaches has become smaller. ;-) But it is really not easy work, not assure success, so let waits for its time.
participants (5)
-
Glenn Linderman
-
martin@v.loewis.de
-
Nick Coghlan
-
Serhiy Storchaka
-
Victor Stinner