
On Mon, Mar 30, 2020 at 10:07:30AM -0700, Andrew Barnert via Python-ideas wrote:
Why? What’s the benefit of building a mutable string around a virtual file object wrapped around a buffer (with all the extra complexities and performance costs that involves, like incremental Unicode encoding and decoding) instead of just building it around a buffer directly?
The quote about adding another abstraction layer solving every problem except the problem of having too many abstraction layers comes to mind. But let's please not hijack this proposal by making it about a full- blown mutable string object. Paul's proposal is simple: add `+=` as an alias to `.write` to StringIO and BytesIO. We have the str concat optimization to cater for people who want to concatenate strings using `buf += str`. You are absolutely right that the correct cross-platform way of doing it is to accumulate a list then join it, but that's an idiom that doesn't come easily to many people. Hence even people who know better sometimes prefer the `buf += str` idiom, and hence the repeated arguments about making join a list method. (But you must accumulate the list with append, not with list concatenation, or you are back to quadratic behaviour.) It seems to me that the least invasive change to write efficient, good looking code is Paul's suggestion to use StringIO or BytesIO with the proposed `+=` operator. Side by side: # best read using a fixed-width font buf = '' buf = [] buf = io.StringIO() for s in strings: for s in strings: for s in strings: buf += s buf.append(s) buf += s buf = ''.join(buf) buf = buf.getvalue() Clearly the first is prettiest, which is why people use it. (It goes without saying that *pretty* is a matter of opinion.) It needs no extra conversion at the end, which is nice. But it's not cross-platform, and even in CPython it's a bit risky. The middle is the most correct, but honestly, it's not that pretty. Many people *really* hate the fact that join is a string method and would rather write `buf.join('')`. The third is, in my opinion, quite nice. With the status quo `buf.write(s)`, it's much less nice. Paul's point about refactoring should be treated more seriously. If you have code that currently has a bunch of `buf += s` scattered around in many places, changing to the middle idiom is difficult: 1. you have to change the buffer initialisation; 2. you have to add an extra conversion to the end; 3. and you have to change every single `buf += s` to `buf.append(s)`. With Paul's proposal, 1 and 2 still apply, but that's just two lines. Three if you include the `import io`. But step 3 is gone. You don't have to change any of the buffer concatenations to appends. Now that's not such a big deal when all of the concatenations are right there in one little loop, but if they are scattered around dozens of methods or functions it can be a significant refactoring step.
More generally, a StringIO is neither the obvious way
If I were new to Python, and wanted to build a string, and knew that repeated concatenation was slow, I'd probably look for some sort of String Builder or String IO class before thinking of *list append*. Especially if I came from a Java background.
nor the fastest way
It's pretty close though. On my test, accumulating 500,000 strings into a list versus a StringIO buffer, then building a string, took 27.5 versus 31.6 ms. Using a string took 36.4 ms. So it's faster than the optimized string concat, and within arm's reach of list+join. Replacing buf.write with `+=` might, theoretically, shave off a bit of the overhead of attribute lookup. That would close the distance a fraction. And maybe there are other future optimizations that could follow. Or maybe not.
nor the recommended way to build strings on the fly in Python, so why do you agree with the OP that we need to make it better for that purpose? Just to benefit people who want to write C++ instead of Python?
If writing `buf += s` is writing C++ instead of Python, then you have spent much of this thread defending the optimization added in version 2.4 to allow people to write C++ instead of Python. So why are you suddenly against it now when the underlying buffer changes from str to StringIO? When I was younger and still smarting from being on the losing side of the Pascal vs C holy wars, I really hated the idea of adding `+=` to Python because it would encourage people to write C instead of Python. I got over it :-) -- Steven