[pypy-dev] Explicitly defining a string buffer object (aka StringIO += operator)

Paul Sokolovsky pmiscml at gmail.com
Sun Mar 29 13:52:30 EDT 2020


Hello,

1. Intro
--------

It is a well-known anti-pattern to use a string as a string buffer, to
construct a long (perhaps very long) string piece-wise. A running
example is:

buf = ""
for i in range(50000):
    buf += "foo"
print(buf)

An alternative is to use a buffer-like object explicitly designed for
incremental updates, which for Python is io.StringIO:

buf = io.StringIO()
for i in range(50000):
    buf.write("foo")
print(buf.getvalue())

As can be seen, this requires changing the way buffer is constructed
(usually in one place), the way buffer value is taken (usually in one
place), but more importantly, it requires changing each line which
adds content to a buffer, and there can be many of those for more
complex algorithms, leading to a code less clear than the original code,
requiring noise-like changes, and complicating updates to 3rd-party code
which needs optimization.

To address this, this RFC proposes to add an __iadd__ method (i.e.
implementing "+=" operator) to io.StringIO and io.BytesIO objects,
making it the exact alias of .write() method. This will allow for
the code very parallel to the original str-using code:

buf = io.StringIO()
for i in range(50000):
    buf += "foo"
print(buf.getvalue())

This will still require updates for buffer construction/getting value,
but that's usually 2 lines. But it will leave the rest of code intact,
and not obfuscate the original content construction algorithm.

2. Performance Discussion
-------------------------

The motivation for this change (of promoting usage of io.StringIO, by
making it look&feel more like str) is performance. But is it really a
problem? Turns out, it is such a pervasive anti-pattern, that recent
versions on CPython3 have a special optimization for it. Let's use
following script for testing:

---------
import timeit
import io

def string():
    sb = u""
    for i in range(50000):
        sb += u"a"

def strio():
    sb = io.StringIO()
    for i in range(50000):
        sb.write(u"a")

print(timeit.timeit(string, number=10))
print(timeit.timeit(strio, number=10))
---------

With CPython3.6 the result is:

$ python3.6 str_iadd-vs-StringIO_write.py 
0.03350826998939738
0.033480543992482126

In other words, there's no difference between usage of str vs StringIO.
But it wasn't always like that, with CPython2.7.17:

$ python2.7 str_iadd-vs-StringIO_write.py 
2.10510993004
0.0399420261383

But Python2 is dead, right? Ok, let's see how Jython3 and IronPython3
fair. To my surprise, there're no (public releases of) such. Both
projects sit firmly in the Python2 territory. So, let's try them:

$ java -jar jython-standalone-2.7.2.jar str_iadd-vs-StringIO_write.py 
10.8869998455
1.74700021744

Sadly, I wasn't able to get to run IronPython.2.7.9.zip on my Linux
system, so I used the online version at https://tio.run/#python2-iron
(after discovering that https://ironpython.net/try/ is dead).

2.7.9 (IronPython 2.7.9 (2.7.9.0) on Mono 4.0.30319.42000 (64-bit))
26.2704391479
1.55628967285

So, it seems that rumors of Python2 being dead are somewhat exaggerated.
Let's try a project which tries to provide "missing migration path"
between Python2 and Python3 - https://github.com/naftaliharris/tauthon

Tauthon 2.8.1+ (heads/master:7da5b76f5b, Mar 29 2020, 18:05:05)
$ tauthon str_iadd-vs-StringIO_write.py 
0.792158126831
0.0467159748077

Whoa, tauthon seems to be faithful to its promise of being half-way
between CPython2 and CPython2.  

Anyway, let's get back to Python3. Fortunately, there's PyPy3, so let's
try that:

$ ./pypy3.6-v7.3.0-linux64/bin/pypy3 str_iadd-vs-StringIO_write.py
0.5423258490045555
0.01754526497097686

Let's not forget little Python brothers and continue with
MicroPython 1.12 (https://github.com/micropython/micropython):

$ micropython str_iadd-vs-StringIO_write.py 
41.63419413566589
0.08073711395263672

Pycopy 3.0.6 (https://github.com/pfalcon/pycopy):

$ pycopy str_iadd-vs-StringIO_write.py 
25.03198313713074
0.0713810920715332

I also wanted to include TinyPy (http://tinypy.org/) and Snek 
(https://github.com/keith-packard/snek) in the shootout, but both
(seem to) lack StringIO object.


These results can be summarized as follows: of more than half-dozen
Python implementations, CPython3 is the only implementation which
optimizes for the dubious usage of an immutable string type as an
accumulating character buffer. For all other implementations, unintended
usage of str incurs overhead of about one order of magnitude, 2 order
of magnitude for implementations optimized for particular usecases
(this includes PyPy optimized for speed vs MicroPython/Pycopy optimized
for small code size and memory usage).

Consequently, other implementations have 2 choices:

1. Succumb to applying the same mis-optimization for string type as
CPython3. (With the understanding that for speed-optimized projects,
implementing mis-optimizations will eat into performance budget, and
for memory-optimized projects, it likely will lead to noticeable
memory bloat.)
2. Struggle against inefficient-by-concept usage, and promote usage of
the correct object types for incremental construction of string content.
This would require improving ergonomics of existing string buffer
object, to make its usage less painful for both writing new code and
refactoring existing.

As you may imagine, the purpose of this RFC is to raise awareness and
try to make headway with the choice 2.

3. Scope Creep, aka "Possible Future Work"
------------------------------------------

The purpose of this RFC is specifically to propose to apply *single*
simple, obvious change. .__iadd__ is just an alias for .write, period.

However, for completeness, it makes sense to consider both alternatives
and where the path of adding "str-like functionality" may lead us.

1. One alternative to patching StringIO would be to introduce a
completely different type, e.g. StringBuf. But that largely would be
"creating more entities without necessity", given that StringIO
already offers needed buffering functionality, and just needs a little
touch of polish with interface. If 2 classes like StringIO and
StringBuf existed, it would be extra quiz to explain difference between
them and why they both exist.

2. On the other hand, this RFC fixates on the output buffering. But
just image how much fun can be done re: input buffers! E.g., we can
define "buf[0]" to have semantics of "tmp = buf.tell(); res =
buf.read(1); buf.seek(tmp); return res". Ditto for .startswith(), etc.,
etc. So, well... This RFC is about making .__iadd__ to be an alias
for .write, and cover with this output buffering usecase. Whoever may
have interest in dealing with input buffer shortcuts would need to
provide a separate RFC, with separate usecases and argumentation.

4. (Self-)Criticism and Risks.
------------------------------

1. The biggest "criticism" I see is a response a-la "there's no problem
with CPython3, so there's nothing to fix". This is related to a bigger
questions "whether a life outside CPython exists", or put more
formally, where's the border between Python-the-language and
CPython-the-implementation. To address this point, I tried to collect
performance stats for a pretty wide array of Python implementations.

2. Another potential criticism is that this may open a scope creep to
add more str-like functionality to classes which classically expose
a stream interface. Paragraph 3.2 is dedicated specifically to address
this point, by invoking hopefully the best-practice approach: request
to focus on the currently proposed feature, which requires very little
changes for arguably noticeable improvements. (At the same time, for
an abstract possibility of this change to be found positive, this may
influence further proposals from interested parties).

3. Implementing the required functionality is pretty easy with a user
subclass:

class MyStringBuf(io.StringIO):
    def __iadd__(self, s):
        self.write(s)
        return self

Voila. The problem is performance. Calling such .__iadd__() method
implementing in Python is 3 times slower than calling .write()
directly (with CPython3.6). But paradigmatic problem is even bigger:
this RFC seeks to establish the best practice of using explicitly
designed for the purpose type with ergonomic interface. Saying that "we
lack such clearly designated type out of the box, but if you figure out
that it's a problem (if you do figure that out), you can easily resolve
that on your side, albeit with a performance hit when compared with it
being provided out of the box" - that's not really welcoming or
ergonomic.

5. Prior Art
------------

As many things related to Python, the idea is not new. I found thread
from 2006 dedicated to it:
https://mail.python.org/pipermail/python-list/2006-January/403480.html
(strangely, there's some mixup in archives, and "thread view" shows
another message as thread starter, though it's not:
https://mail.python.org/pipermail/python-list/2006-January/357453.html) .

The discussion there seemed to be without clear resolution and has
been swamped into discussion of unicode handling complexities in
Python2 and implementation details of "StringIO" vs "cStringIO" modules
(both of which were deprecated in favor of "io").


-- 
Best regards,
 Paul                          mailto:pmiscml at gmail.com


More information about the pypy-dev mailing list