[Python-ideas] Re: Explicitly defining a string buffer object (aka StringIO += operator)

March 30, 2020

      On Mar 30, 2020, at 13:06, Paul Sokolovsky <pmiscml@gmail.com> wrote:
...
I appreciate expressing it all concisely and clearly. Then let me
respond here instead of the very first '"".join() rules!' reply I got.
Ignoring replies doesn’t actually answer them.
...
The issue with "".join() is very obvious:
------
import io
import sys
def strio():
   sb = io.StringIO()
   for i in range(50000):
       sb.write(u"==%d==" % i)
   print(sys.getsizeof(sb) + sys.getsizeof(sb.getvalue()))
This doesn’t tell you anything useful. As the help for getsizeof makes clear, “Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to”. So this gives you some fixed value like 152, no matter how big the buffer and other internal objects may be.

If you’re using CPython with the C accelerator, none of those things are available to you from the API, but a quick scan of the C source shows what’s there, and it’s generally actually more storage than the list version. Oversimplifying a bit: While you’re building, it keeps a _PyAccu structure, which is basically a wrapper around that same list of strings. When you call getvalue() it then builds a Py_UCS4* representation that’s in this case 4x the size of the final string (since your string is pure ASCII and will be stored in UCS1, not UCS4). And then there’s the final string.

So, if this memory issue makes join unacceptable, it makes your optimization even more unacceptable.

And thinking about portable code makes it even worse. Your code might be run under CPython and take even more memory, or it might be run under a different Python implementation where StringIO is not accelerated (where it’s just a TextIOWrapper around a BytesIO) and therefore be a whole lot slower instead. So it has to be able to deal with both of those possibilities, not just one; code that uses the usual idiom, on the other hand, behaves pretty similarly on all implementations.
...
There's absolutely no need why performing trivial operation of
accumulating string content should take about order of magnitude more
memory than actually needed for that string content. Don't get me wrong
- if you want to spend that much of your memory, then sure, you can. But
jumping with that as *the only right solution* whenever somebody
mentions "string concatenation" is a bit ... umm, cavalier
And making a wild guess about how things might be implemented and offering an optimization based on that guess that actually makes things worse and refusing to even reply when people point out the problems isn’t even more cavalier?
...
My whole concern is along 2 lines:
1. This StringBuilder class *could* be an existing io.StringIO.
2. By just adding __iadd__ operator to it.
No, it really couldn’t. The semantics are wrong (unless you want, say, universal newline handling in your string builder?), it’s optimized for a different use case than string building, and both the pure-Python and CPython accelerator implementations are less efficient in speed and/or memory.
...
That's it, nothing else. What's inside StringIO class is up to you (dear
various Python implementations, their maintainers, and contributors).
Sure, but what’s inside has to actually perform the job it was designed to do and is documented to do: to simulate a file object in memory. Which is not the same thing as being a string builder.
...
For example, fans of "".join() surely can have it inside. Actually,
it's a known fact that Python2's "StringIO" module (the original home
of StringIO class) was implemented exactly like that, so you can go
straight back to the future.
Python2’s StringIO module is for bytes, not Unicode strings. If you want a mutable bytes-like type, bytearray already exists; there’s no need to wrap the sequence up in a file-like API just to rewrap that in a sequence-like API again; just use the sequence directly. What StringIO is there for is when you _need_ the file API, just as in Python 3’s io.BytesIO. It’s not a more efficient bytearray or one better suited for string building; it’s less efficient and less well suited for string building but it adds different features.
...
And again, the need for anything like that might be unclear for
CPython-only users. Such users can write a StringBuilder class like
above, or repeat the beautiful "".join() trick over and over again. The
need for a nice string builder class may occur only from the
consideration of the Python-as-a-language lacking a clear and nice
abstraction for it, and from thinking how to add such an abstraction in
a performant way (of which criteria are different) in as many
implementation as possible, in as easy as possible way. (At least
that's my path to it, I'm not sure if a different thought process might
lead to it too.)
The problem isn’t your start, it’s jumping to the assumption that StringIO must be an answer, and then not checking the docs and the code to see if there are problems, and then ignoring the problems when they’re pointed out. Why do you think a virtual file object must be the optimal way to implement a string builder in the first place?

[Python-ideas] Re: Explicitly defining a string buffer object (aka StringIO += operator)

Andrew Barnert