io.BytesIO slower than monkey-patching io.RawIOBase
While working on #1767933, Serhiy came up with an observation that "monkey-patching" one of the base classes of io is faster than using BytesIO when in need of a file-like object for writing into. I've distilled it into this standalone test: import io data = [b'a'*10, b'bb'*5, b'ccc'*5] * 10000 def withbytesio(): bio = io.BytesIO() for i in data: bio.write(i) return bio.getvalue() def monkeypatching(): mydata = [] file = io.RawIOBase() file.writable = lambda: True file.write = mydata.append for i in data: file.write(i) return b''.join(mydata) The second approach is consistently 10-20% faster than the first one (depending on input) for trunk Python 3.3 Is there any reason for this to be so? What does BytesIO give us that the second approach does not (I tried adding more methods to the patched RawIOBase to make it more functional, like seekable() and tell(), and it doesn't affect performance)? This also raises a "moral" question - should I be using the second approach deep inside the stdlib (ET.tostring) just because it's faster? Eli
The second approach is consistently 10-20% faster than the first one (depending on input) for trunk Python 3.3
I think the difference is that StringIO spends extra time reallocating memory during the write loop as it grows, whereas bytes.join computes the allocation size first since it already knows the final length.
On Tue, Jul 17, 2012 at 2:57 PM, John O'Connor
The second approach is consistently 10-20% faster than the first one (depending on input) for trunk Python 3.3
I think the difference is that StringIO spends extra time reallocating memory during the write loop as it grows, whereas bytes.join computes the allocation size first since it already knows the final length.
BytesIO is actually missing an optimisation that is already used in StringIO: the StringIO C implementation uses a fragment accumulator internally, and collapses that into a single string object when getvalue() is called. BytesIO is still using the old "resize-the-buffer-as-you-go" strategy, and thus ends up repeatedly reallocating the buffer as the data sequence grows incrementally. It should be optimised to work the same way StringIO does (which is effectively the same way that the monkeypatched version works) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Tue, 17 Jul 2012 06:34:14 +0300
Eli Bendersky
Is there any reason for this to be so? What does BytesIO give us that the second approach does not (I tried adding more methods to the patched RawIOBase to make it more functional, like seekable() and tell(), and it doesn't affect performance)?
Well, try implementing non-trivial methods such as readline() or seek(), and writing in the middle rather than at the end. As Nick said, we could implement the same optimization as in StringIO, i.e. only materialize the buffer when necessary. Regards Antoine. -- Software development and contracting: http://pro.pitrou.net
On 17.07.12 06:34, Eli Bendersky wrote:
The second approach is consistently 10-20% faster than the first one (depending on input) for trunk Python 3.3
Is there any reason for this to be so? What does BytesIO give us that the second approach does not (I tried adding more methods to the patched RawIOBase to make it more functional, like seekable() and tell(), and it doesn't affect performance)?
BytesIO resizes underlying buffer if it overflowed (overallocating 1/8 of size and copying old content to new buffer). Total it makes log[9/8](N) allocations and copy 8*N bytes (for large N). List uses the same strategy, but number of chunks usually significantly less than number of bytes. At the end all this chunks concatenated by join, which calculates sum of chunk lengths and allocate the resulting array with the desired size. That is why append/join is faster than BytesIO in this case. There are other note, about ElementTree.tostringlist(). Creating DataStream class in every function call is too expensive, and that is why "monkeypatched" version several times is faster than DataStream version for small data. But for long data it is faster too, because data.append() is on one lookup slower than "monkeypatched" write=data.append.
This also raises a "moral" question - should I be using the second approach deep inside the stdlib (ET.tostring) just because it's faster?
Please note that the previous version of Python used a "monkeypatching".
On Tue, Jul 17, 2012 at 2:59 PM, Serhiy Storchaka
On 17.07.12 06:34, Eli Bendersky wrote:
The second approach is consistently 10-20% faster than the first one (depending on input) for trunk Python 3.3
Is there any reason for this to be so? What does BytesIO give us that the second approach does not (I tried adding more methods to the patched RawIOBase to make it more functional, like seekable() and tell(), and it doesn't affect performance)?
BytesIO resizes underlying buffer if it overflowed (overallocating 1/8 of size and copying old content to new buffer). Total it makes log[9/8](N) allocations and copy 8*N bytes (for large N). List uses the same strategy, but number of chunks usually significantly less than number of bytes. At the end all this chunks concatenated by join, which calculates sum of chunk lengths and allocate the resulting array with the desired size. That is why append/join is faster than BytesIO in this case.
I've created http://bugs.python.org/issue15381 to track this (optimizing BytesIO).
There are other note, about ElementTree.tostringlist(). Creating DataStream class in every function call is too expensive, and that is why "monkeypatched" version several times is faster than DataStream version for small data. But for long data it is faster too, because data.append() is on one lookup slower than "monkeypatched" write=data.append.
I updated tostringlist() to use an outside class. This brings performance back to the old code. Eli
participants (5)
-
Antoine Pitrou
-
Eli Bendersky
-
John O'Connor
-
Nick Coghlan
-
Serhiy Storchaka