[Python-ideas] Re: Explicitly defining a string buffer object (aka StringIO += operator)

March 29, 2020

      ...
On Mar 29, 2020, at 10:57, Paul Sokolovsky <pmiscml@gmail.com> wrote:
It is a well-known anti-pattern to use a string as a string buffer, to
construct a long (perhaps very long) string piece-wise. A running
example is:
buf = ""
for i in range(50000):
  buf += "foo"
print(buf)
An alternative is to use a buffer-like object explicitly designed for
incremental updates, which for Python is io.StringIO:
It’s usually an even better alternative to just put the strings into a list of strings (or to write a generator that yields them), and then pass that to the the join method. This is recommended in the official Python FAQ. It’s usually about 40% faster than using StringIO or relying on the string-concat optimization in CPython, it’s efficient across all implementations of Python, and it’s obvious _why_ it’s efficient. It can sometimes take more memory, but the tradeoffs is usually worth it.

This has been well known in the Python community for decades. People coming from C++ look for something like stringstream and find StringIO; people coming from Java look for something like StringBuilder and build their own version around StringIO; people who are comfortable with Python use str.join. So third-party libraries that don’t do that are likely either (a) not expecting large amounts of data (and therefore probably suboptimal in other areas), or (b) written by someone who doesn’t really get Python.

So what is StringIO for? For being a file object, but in memory rather than representing a file. Its API is exactly the same as every other file object, because that’s the whole point of it.
...
As can be seen, this requires changing the way buffer is constructed
(usually in one place), the way buffer value is taken (usually in one
place), but more importantly, it requires changing each line which
adds content to a buffer, and there can be many of those for more
complex algorithms, leading to a code less clear than the original code,
requiring noise-like changes, and complicating updates to 3rd-party code
which needs optimization.
To address this, this RFC proposes to add an __iadd__ method (i.e.
implementing "+=" operator) to io.StringIO and io.BytesIO objects,
making it the exact alias of .write() method. This will allow for
the code very parallel to the original str-using code:
So your goal is to allow people to use badly-written third-party libs designed around the string-concat antipattern, without fixing those libs, by feeding them StringIO objects when they expected str objects?

This seems like a solution to a theoretical problem that might work for some instances of that problem. But do you have any actual examples of third-party libs that have this problem, and that (obviously) break if you give them StringIO objects, but would not break when passed a StringIO with __iadd__?
...
But it wasn't always like that, with CPython2.7.17:
$ python2.7 str_iadd-vs-StringIO_write.py 
2.10510993004
0.0399420261383
But Python2 is dead, right?
Yes. Not as in “nobody will ever run it again”, but definitely as in “no new feature you add to Python will be backported”. Python 2.7 the language and CPython 2.7 the implementation have been feature-frozen for years now, and now they’re not even supported by the Python organization at all. So, trying to improve the behavior of Python 2.7 code by making a proposal for Python won’t get you anywhere. Adding StringIO.__iadd__ to Python 3.10 will not help anyone using Python 2.7.

In fact, even if you somehow convinced everyone to make the extraordinary decision to re-open Python 2.7 and make a new 2.7.18 release with this feature backported, it still wouldn’t help the vast majority of people using Python 2.7, because most people using Python 2.7 are using stable systems with stable versions that they don’t update for years. That’s why they’re still using 2.7 in the first place: because 2.7.16 is what comes with the Linux LTS they’ve settled on for deployment, or it’s what comes with the macOS version they use for their dev boxes, or Jython doesn’t have a 3.x version yet, or whatever. So a new feature in 2.7.18 wouldn’t get to them for years, if ever.

It’s also worth noting that the io module is very slow in most Python 2.x implementations. There’s a separate (and older) StringIO module, and for CPython an accelerated cStringIO, and you almost certainly want to use those, not io, here. (Except, of course, that what you really want to use is join anyway.)
...
Ok, let's see how Jython3 and IronPython3
fair. To my surprise, there're no (public releases of) such. Both
projects sit firmly in the Python2 territory.
The last IronPython release, 2.7.9, was in 2018. As the release notes for that version say, “With this release, we will shift the majority of work to IronPython3.” Of course IronPython3 isn’t ready for prime time yet, but it’s not because they’re still firmly in Python2 territory and still making major improvements to their 2.7 branch, it’s because it’s taking a long time to finish their 3.x branch (in part because they no longer have Microsoft and Unity throwing resources at the project). They’re not adding new features to 2.7 any more than CPython is. (They are working on a 2.7.10; but it’s just 2.7.9 with support for more .NET runtimes plus porting some security fixes from the last CPython 2.7 stdlib.) I don’t know the situation with Jython as well, but I believe it’s similar.
...
Consequently, other implementations have 2 choices:
1. Succumb to applying the same mis-optimization for string type as
CPython3. (With the understanding that for speed-optimized projects,
implementing mis-optimizations will eat into performance budget, and
for memory-optimized projects, it likely will lead to noticeable
memory bloat.)
2. Struggle against inefficient-by-concept usage, and promote usage of
the correct object types for incremental construction of string content.
This would require improving ergonomics of existing string buffer
object, to make its usage less painful for both writing new code and
refactoring existing.
3. Recognize that Python and CPython have been promoting str.join for this problem for decades, and most performance-critical code is already doing that, and make sure that solution is efficient, and recognjze that poorly-written code is uncommon but does exist, and may take a bit more work to optimize than a 1-line change to optimize, but that’s acceptable—and not the responsibility of any alternate Python implementation to help with.

[Python-ideas] Re: Explicitly defining a string buffer object (aka StringIO += operator)

Andrew Barnert