[Python-ideas] Re: Explicitly defining a string buffer object (aka StringIO += operator)

April 12, 2020

      ...
On Apr 11, 2020, at 06:52, Paul Sokolovsky <pmiscml@gmail.com> wrote:
...
And a StringBuilder class would be another way.
StringBuilder would be just subset of functionality of StringIO, and
would be hard to survive Occam's razor. (Mine is sharp to cut it off
right away.)
I think _this_ is actually the root of the disagreement.

A StringBuilder that does one thing and does it well survives Occam’s razor in lots of other languages, like Java. Why? That one thing could be done by a mutable string object, or by a string stream object, so why not just pile it into one of those instead? Because piling it into one of those means you run into conflicting requirements, which force you to make hard tradeoffs, and possibly tradeoffs that are bad for other code, and possibly that break assumptions that existing other code has relied on for years.

Python’s StringIO is readable as well as writable. (If I have a library that wants a file object, and I have the data in memory, I just wrap it in a StringIO and now I have that file object. People use it for this all the time.) It also has a current position pointer, and can seek back to previously marked locations. It has optional newlines conversion. It has all the behavior that a file object has to have, and code relies on that fact, and that forces design decisions on you that may not be optimal for a StringBuilder.

It sounds like you already know the issues with mutable strings, so I won’t go over them here.

A stand-alone StringBuilder doesn’t have to do those things; it just has to append characters or strings to the end, and be able to give you a string when you’re done. So it can be optimal and at the same time dead simple. It can be nothing more than a dynamically-expanding array (or realloc buffer) of UCS4 characters. Or, if you want to (usually) trade a bit of time for a lot of space savings, it can be a union of a dynamically-expanding array of UCS1/2/4 characters (that has to reallocate and copy the first time you append an out-of-range character), but that’s still a whole lot simpler in a StringBuilder than in something that has to meet the str and PyUnicode APIs, or the file object APIs. Or you could design something more complicated if that turns out to work better. If any of these makes it hard to implement persistent seek positions that work even after you’ve reallocated, wastes overflow space when you’re using it just to read from an immutable input, etc., that would be completely irrelevant, because, unlike StringIO, nobody can ask a StringBuilder to do any of those things, so your design doesn’t have to support them.

Plus, looking beyond CPython, a new class can have whatever cross-implementation requirements we write into it. You can document that a StringBuilder doesn’t retain all of its input strings, but is at minimum roughly as efficient as making a list of strings and joining them anyway, and every Python implementation will do that (or just not implement the class at all, if they can’t, and document that fact, the reason why, and the recommended porting alternative very high up in a “differences from CPython” chapter), and any backport will too. You can’t document that about StringIO, because it would just be a lie for most existing implementations (including CPython 2.6-3.9, PyPy, etc.).
...
I see, it's whole different concept for you. But as I mentioned,
they're the same concept for me - both stream and buffer *are*
protocols. And that's based on my desire to define Python as a generic
programming language, based on a few consistent and powerful concepts.
Sure, buffers and streams are protocols, but they’re not the same protocol. A buffer is all about random access; a stream is not.

And file is a protocol too. There are even ABCs for it. It’s also not the same protocol as the simpler thing you’re thinking of as stream, of course, but it’s certainly a protocol.

And Python already is a generic language in your sense; most code is written around protocols like file and buffer and iterable and mapping and even number. Pythonic code, whenever possible, doesn’t care if I feed it a shelve instead of a dict, or a np.array of float64 instead of a float, or a StringIO instead of a TextIOWrapper around a FileIO. And people rely on that fact all the time. And you usually don’t even have to do anything special to make that true for your libraries.

Your real problem seems to be just that you wish Python were designed around a simpler stream protocol instead of the big and messy file protocol.

Maybe that would be better. File could be a subtype or wrapper, or maybe even a collection of them that could be composed as needed—you don’t always need seekability just because you need newline conversion, or vice versa. Java’s granular streams design is actually pretty handy at times (and I think it’s completely orthogonal to their horrible and verbose API around getting, building, and using streams). Then maybe OutputStringStream would just obviously be usable as a builder (which is almost, but not quite, true for C++). And there might be other benefits too. (We could also definitely have a cleaner API for things like socket.makefile, which today looks like a file but raises on many operations.)

But that’s not the language we have. And it still won’t be the language we have if you add an __iadd__ method to StringIO. Making StringIO not be a fully-featured and optimal-for-file-like-usage file object isn’t an option, because you can’t break all the code that depends on it. The only way to get there from here would be to design a complete new stream system and get the vast majority of the Python ecosystem to switch over to using it. Which is a pretty huge ask. (And it still won’t let you just add __iadd__ to StringIO; it’ll only let you add __iadd__ to that new OutputStringStream.)

[Python-ideas] Re: Explicitly defining a string buffer object (aka StringIO += operator)

Andrew Barnert