[Python-Dev] cStringIO vs io.BytesIO
kmike84 at gmail.com
Wed Jul 16 23:44:23 CEST 2014
cStringIO was removed from Python 3. It seems the suggested replacement is
io.BytesIO. But there is a problem: cStringIO.StringIO(b'data') didn't copy
the data while io.BytesIO(b'data') makes a copy (even if the data is not
This means io.BytesIO is not suited well to cases when you want to get a
readonly file-like interface for existing byte strings. Isn't it one of the
main io.BytesIO use cases? Wrapping bytes in cStringIO.StringIO used to be
almost free, but this is not true for io.BytesIO.
So making code 3.x compatible by ditching cStringIO can cause a serious
performance/memory regressions. One can change the code to build the data
using BytesIO (without creating bytes objects in the first place), but that
is not always possible or convenient.
I believe this problem affects tornado (
https://github.com/tornadoweb/tornado/issues/1110), Scrapy (this is how I
became aware of this issue), NLTK (anecdotical evidence - I tried to port
some hairy NLTK module to io.BytesIO, it became many times slower) and
maybe pretty much every IO-related project ported to Python 3.x (django -
werkzeug and frameworks based on it - check
requests - check
- they all wrap user data to BytesIO, and this may cause slowdowns and up
to 2x memory usage in Python 3.x).
Do you know if there a workaround? Maybe there is some stdlib part that I'm
missing, or a module on PyPI? It is not that hard to write an own wrapper
that won't do copies (or to port [c]StringIO to 3.x), but I wonder if there
is an existing solution or plans to fix it in Python itself - this BytesIO
use case looks quite important.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Python-Dev