
Hello, 1. Intro -------- It is a well-known anti-pattern to use a string as a string buffer, to construct a long (perhaps very long) string piece-wise. A running example is: buf = "" for i in range(50000): buf += "foo" print(buf) An alternative is to use a buffer-like object explicitly designed for incremental updates, which for Python is io.StringIO: buf = io.StringIO() for i in range(50000): buf.write("foo") print(buf.getvalue()) As can be seen, this requires changing the way buffer is constructed (usually in one place), the way buffer value is taken (usually in one place), but more importantly, it requires changing each line which adds content to a buffer, and there can be many of those for more complex algorithms, leading to a code less clear than the original code, requiring noise-like changes, and complicating updates to 3rd-party code which needs optimization. To address this, this RFC proposes to add an __iadd__ method (i.e. implementing "+=" operator) to io.StringIO and io.BytesIO objects, making it the exact alias of .write() method. This will allow for the code very parallel to the original str-using code: buf = io.StringIO() for i in range(50000): buf += "foo" print(buf.getvalue()) This will still require updates for buffer construction/getting value, but that's usually 2 lines. But it will leave the rest of code intact, and not obfuscate the original content construction algorithm. 2. Performance Discussion ------------------------- The motivation for this change (of promoting usage of io.StringIO, by making it look&feel more like str) is performance. But is it really a problem? Turns out, it is such a pervasive anti-pattern, that recent versions on CPython3 have a special optimization for it. Let's use following script for testing: --------- import timeit import io def string(): sb = u"" for i in range(50000): sb += u"a" def strio(): sb = io.StringIO() for i in range(50000): sb.write(u"a") print(timeit.timeit(string, number=10)) print(timeit.timeit(strio, number=10)) --------- With CPython3.6 the result is: $ python3.6 str_iadd-vs-StringIO_write.py 0.03350826998939738 0.033480543992482126 In other words, there's no difference between usage of str vs StringIO. But it wasn't always like that, with CPython2.7.17: $ python2.7 str_iadd-vs-StringIO_write.py 2.10510993004 0.0399420261383 But Python2 is dead, right? Ok, let's see how Jython3 and IronPython3 fair. To my surprise, there're no (public releases of) such. Both projects sit firmly in the Python2 territory. So, let's try them: $ java -jar jython-standalone-2.7.2.jar str_iadd-vs-StringIO_write.py 10.8869998455 1.74700021744 Sadly, I wasn't able to get to run IronPython.2.7.9.zip on my Linux system, so I used the online version at https://tio.run/#python2-iron (after discovering that https://ironpython.net/try/ is dead). 2.7.9 (IronPython 2.7.9 (2.7.9.0) on Mono 4.0.30319.42000 (64-bit)) 26.2704391479 1.55628967285 So, it seems that rumors of Python2 being dead are somewhat exaggerated. Let's try a project which tries to provide "missing migration path" between Python2 and Python3 - https://github.com/naftaliharris/tauthon Tauthon 2.8.1+ (heads/master:7da5b76f5b, Mar 29 2020, 18:05:05) $ tauthon str_iadd-vs-StringIO_write.py 0.792158126831 0.0467159748077 Whoa, tauthon seems to be faithful to its promise of being half-way between CPython2 and CPython2. Anyway, let's get back to Python3. Fortunately, there's PyPy3, so let's try that: $ ./pypy3.6-v7.3.0-linux64/bin/pypy3 str_iadd-vs-StringIO_write.py 0.5423258490045555 0.01754526497097686 Let's not forget little Python brothers and continue with MicroPython 1.12 (https://github.com/micropython/micropython): $ micropython str_iadd-vs-StringIO_write.py 41.63419413566589 0.08073711395263672 Pycopy 3.0.6 (https://github.com/pfalcon/pycopy): $ pycopy str_iadd-vs-StringIO_write.py 25.03198313713074 0.0713810920715332 I also wanted to include TinyPy (http://tinypy.org/) and Snek (https://github.com/keith-packard/snek) in the shootout, but both (seem to) lack StringIO object. These results can be summarized as follows: of more than half-dozen Python implementations, CPython3 is the only implementation which optimizes for the dubious usage of an immutable string type as an accumulating character buffer. For all other implementations, unintended usage of str incurs overhead of about one order of magnitude, 2 order of magnitude for implementations optimized for particular usecases (this includes PyPy optimized for speed vs MicroPython/Pycopy optimized for small code size and memory usage). Consequently, other implementations have 2 choices: 1. Succumb to applying the same mis-optimization for string type as CPython3. (With the understanding that for speed-optimized projects, implementing mis-optimizations will eat into performance budget, and for memory-optimized projects, it likely will lead to noticeable memory bloat.) 2. Struggle against inefficient-by-concept usage, and promote usage of the correct object types for incremental construction of string content. This would require improving ergonomics of existing string buffer object, to make its usage less painful for both writing new code and refactoring existing. As you may imagine, the purpose of this RFC is to raise awareness and try to make headway with the choice 2. 3. Scope Creep, aka "Possible Future Work" ------------------------------------------ The purpose of this RFC is specifically to propose to apply *single* simple, obvious change. .__iadd__ is just an alias for .write, period. However, for completeness, it makes sense to consider both alternatives and where the path of adding "str-like functionality" may lead us. 1. One alternative to patching StringIO would be to introduce a completely different type, e.g. StringBuf. But that largely would be "creating more entities without necessity", given that StringIO already offers needed buffering functionality, and just needs a little touch of polish with interface. If 2 classes like StringIO and StringBuf existed, it would be extra quiz to explain difference between them and why they both exist. 2. On the other hand, this RFC fixates on the output buffering. But just image how much fun can be done re: input buffers! E.g., we can define "buf[0]" to have semantics of "tmp = buf.tell(); res = buf.read(1); buf.seek(tmp); return res". Ditto for .startswith(), etc., etc. So, well... This RFC is about making .__iadd__ to be an alias for .write, and cover with this output buffering usecase. Whoever may have interest in dealing with input buffer shortcuts would need to provide a separate RFC, with separate usecases and argumentation. 4. (Self-)Criticism and Risks. ------------------------------ 1. The biggest "criticism" I see is a response a-la "there's no problem with CPython3, so there's nothing to fix". This is related to a bigger questions "whether a life outside CPython exists", or put more formally, where's the border between Python-the-language and CPython-the-implementation. To address this point, I tried to collect performance stats for a pretty wide array of Python implementations. 2. Another potential criticism is that this may open a scope creep to add more str-like functionality to classes which classically expose a stream interface. Paragraph 3.2 is dedicated specifically to address this point, by invoking hopefully the best-practice approach: request to focus on the currently proposed feature, which requires very little changes for arguably noticeable improvements. (At the same time, for an abstract possibility of this change to be found positive, this may influence further proposals from interested parties). 3. Implementing the required functionality is pretty easy with a user subclass: class MyStringBuf(io.StringIO): def __iadd__(self, s): self.write(s) return self Voila. The problem is performance. Calling such .__iadd__() method implementing in Python is 3 times slower than calling .write() directly (with CPython3.6). But paradigmatic problem is even bigger: this RFC seeks to establish the best practice of using explicitly designed for the purpose type with ergonomic interface. Saying that "we lack such clearly designated type out of the box, but if you figure out that it's a problem (if you do figure that out), you can easily resolve that on your side, albeit with a performance hit when compared with it being provided out of the box" - that's not really welcoming or ergonomic. 5. Prior Art ------------ As many things related to Python, the idea is not new. I found thread from 2006 dedicated to it: https://mail.python.org/pipermail/python-list/2006-January/403480.html (strangely, there's some mixup in archives, and "thread view" shows another message as thread starter, though it's not: https://mail.python.org/pipermail/python-list/2006-January/357453.html) . The discussion there seemed to be without clear resolution and has been swamped into discussion of unicode handling complexities in Python2 and implementation details of "StringIO" vs "cStringIO" modules (both of which were deprecated in favor of "io"). -- Best regards, Paul mailto:pmiscml@gmail.com