_PyBytesWriter/_PyUnicodeWriter could be faster
Some code needs to maintain an output buffer that has an unpredictable size. Such as bz2/lzma/zlib modules, _PyBytesWriter/_PyUnicodeWriter. In current code, when the output buffer grows, resizing will cause unnecessary memcpy(). issue41486 uses memory blocks to represent output buffer in bz2/lzma/zlib modules, it could eliminate the overhead of resizing. There are benchmark charts in issue41486: https://bugs.python.org/issue41486 _PyBytesWriter/_PyUnicodeWriter could use the same way. If write a "general blocks output buffer", it could be used in _PyBytesWriter/bz2/lzma/zlib. (issue41486 is not very general, it uses a bytes object to represent a memory block.) If write a new _PyUnicodeWriter like this, it has a chance to eliminate the overhead of switching PyUnicode_Kind (record the switching position): 'a' * 100_000_000 + '\uABCD' If anyone has time and is willing to try, it's very welcome. Or I might do this at sometime in the future.
Hi, Le dim. 25 oct. 2020 à 15:36, Ma Lin <malincns@163.com> a écrit :
Some code needs to maintain an output buffer that has an unpredictable size. Such as bz2/lzma/zlib modules, _PyBytesWriter/_PyUnicodeWriter.
In current code, when the output buffer grows, resizing will cause unnecessary memcpy().
issue41486 uses memory blocks to represent output buffer in bz2/lzma/zlib modules, it could eliminate the overhead of resizing.
Some context. _PyBytesWriter is an internal C API designed for C functions which return a bytes or a bytearray object and use a loop writing into "ptr" (pointer into a bytes buffer). Such functions expect a single contiguous memory block. It is based on realloc() and overallocation (which can be disabled in the API). It uses a bytes object which is resized on demand. It also uses a short buffer of 512 bytes allocated on the stack memory for short strings. _PyBytesWriter_Finish() calls _PyBytes_Resize() if needed. In 2016, I wrote an article on this API: https://vstinner.github.io/pybyteswriter.html realloc() does not always imply to copy memory. Growing a memory block can sometimes be done in-place (no data copy). Same when you shrink a memory block in _PyBytesWriter_Finish(). Also, overallocation reduces the number of recall() calls. _PyBytesWriter design is optimized for short strings up to 100 bytes. -- _PyUnicodeWriter API is designed for the PEP 393 compact string structure (ASCII, Py_UCS1 latin1, Py_UCS2 and Py_UCS4 formats). It tries to reduce conversions between the 3 formats (Py_UCS1, Py_UCS2 and Py_UCS4) and also uses overallocation to reduce memory copies. -- By the way, _PyBytesWriter and _PyUnicodeWriter overallocation is different on Windows: #ifdef MS_WINDOWS /* On Windows, overallocate by 50% is the best factor */ # define OVERALLOCATE_FACTOR 2 #else /* On Linux, overallocate by 25% is the best factor */ # define OVERALLOCATE_FACTOR 4 #endif -- The internal C API _PyAccu is a variant of _PyUnicodeWriter which uses a list of short strings and sometimes concatenates these strings into a single large string.
_PyBytesWriter/_PyUnicodeWriter could use the same way.
If write a "general blocks output buffer", it could be used in _PyBytesWriter/bz2/lzma/zlib. (issue41486 is not very general, it uses a bytes object to represent a memory block.)
I understand that the main idea is to not use a single buffer, but use a list of buffers, and concatenate them in _BlocksOutputBuffer_Finish(). Similar idea to PyAccu API. Maybe some functions using _PyBytesWriter can be adapted to use a list of buffers rather than a single buffer. But I'm not convinced that it would make them faster. The question is which kind of functions you want to optimize, for which string length, etc. You should dig into the old issues where I optimized str%args and str.format(): * http://bugs.python.org/issue14687 : str % args * http://bugs.python.org/issue14744 : str.format() * https://bugs.python.org/issue2534 : bytes % args I used benchmarks like: https://github.com/vstinner/pymicrobench/blob/master/bench_bytes_format_int.... https://github.com/vstinner/pymicrobench/blob/master/bench_str_format.py https://github.com/vstinner/pymicrobench/blob/master/bench_str_format_keywor...
If write a new _PyUnicodeWriter like this, it has a chance to eliminate the overhead of switching PyUnicode_Kind (record the switching position):
'a' * 100_000_000 + '\uABCD'
For a+b, Python first computes "a", then "b", and finally "a+b". I don't see how your API could optimize such code. For operations on strings like "%s%s" % (a, b) or "{}{}".format(a, b), Python internally uses _PyUnicodeWriter. To format "a", _PyUnicodeWriter just stores a reference to it as _PyUnicodeWriter.buffer and marks the buffer as read-only (optimization when the result is made of a single string: no copy is made at all!). To format "b", _PyUnicodeWriter_WriteStr() converts the buffer to Py_UCS2 and then writes the new string. The "a" string is only written "once", not twice. I don't see how your API would avoid copies in such cases. Moreover, str % args and str.format() are optimized to avoid over-allocation when "b" is written: the final _PyUnicodeWriter_Finish() call is free, it does nothing.
If anyone has time and is willing to try, it's very welcome. Or I might do this at sometime in the future.
I can be completely wrong, please try and show benchmarks proving that your approach is faster on specific use cases, without hurting performances on short strings ;-) Victor -- Night gathers, and now my watch begins. It shall not end until my death.
Thanks for your very informative reply. I replied you in issue41486. Maybe memory blocks will not bring performance improvement to _PyBytesWriter/_PyUnicodeWriter, which is a bit frustrating.
For a+b, Python first computes "a", then "b", and finally "a+b". I don't see how your API could optimize such code.
I mean this situation: s = 'a' * 100_000_000 + '\uABCD' b = s.encode('utf-8') b.encode('utf-8') # <- this situation I realize I was wrong, the UCS1->UCS2 transformation will only be done once, it only saves a memcpy(). Even in this case it will only save two memcpy(): s = 'a' * 100_000_000 + '\uABCD' * 100_000_000 + '\U00012345' b = s.encode('utf-8') b.encode('utf-8')
participants (2)
-
Ma Lin
-
Victor Stinner