[Python-ideas] Re: Explicitly defining a string buffer object (aka StringIO += operator)

March 30, 2020

      On Mar 30, 2020, at 08:29, Joao S. O. Bueno <jsbueno@python.org.br> wrote:
...

I agree with the arguments the OP brings forward.
Maybe, it should be the case of having an `StringIO` and `BytesIO` subclass?
Or better yet, just a class that wraps those, and hide away the other file-like
methods and behaviors?
Why? What’s the benefit of building a mutable string around a virtual file object wrapped around a buffer (with all the extra complexities and performance costs that involves, like incremental Unicode encoding and decoding) instead of just building it around a buffer directly?

Also, how can you implement an efficient randomly-accessible mutable string object on top of a text file object? Text files don’t do constant-time random-access seek to character positions; they can only seek to the opaque tokens returned by tell. (This should be obvious if you think about how you could seek to the 137th character in a UTF-8 file without reading all of the first 137 characters.) (In fact, recent versions of CPython optimize StringIO so it only fakes being a TextIOWrapper around a BytesIO and actually uses a Py_UCS4* buffer for storage, but that’s CPython-specific, not guaranteed, and not accessible from Python even in CPython.)

And, even if that were a good idea for implementation reasons, why should the user care? If they need a mutable string, why do they care whether you give them one that inherits from or delegates to a StringIO instead of a list or an array.array of int32 or the CPython string buffer API (whether accessed via a C extension or ctypes.pythonapi) or a pure C library with its own implementation and optimizations?

More generally, a StringIO is neither the obvious way nor the fastest way nor the recommended way to build strings on the fly in Python, so why do you agree with the OP that we need to make it better for that purpose? Just to benefit people who want to write C++ instead of Python? If the goal is to cater to people who won’t read the docs to learn the right way, the obvious solution is to mandate the non-quadratic string concatenation of CPython for all implementations, not to give them yet another way of doing it and hope they’ll guess or look up that one even though they didn’t guess or look up the long-standing existing one.
...
That would keep the new class semantically as a string,
and they could implement all of the str/bytes methods and attributes 
so as to be a drop-in replacement
Sadly, this isn’t possible. Large amounts of C code—including builtins and stdlib—won’t let you duck type as a string; as it will do a type check and expect an actual str (and if you subclass str, it will ignore your methods and use the PyUnicode APIs to get your base class’s storage directly as a buffer instead). So, no type, either C or Python, can really be a drop-in replacement for str. At best you can have something that you have to call str() on half the time. That’s why there’s no MutableStr on PyPI, and no UTF8Str, no EncodedStr that can act as both a bytes and a str by remembering its encoding (Nick Coghlan’s motivating example for changing this back in the early 3.x days), etc.

Fixing this cleanly would probably require splitting the string C API into abstract and concrete versions a la sequence and then changing a ton of code to respect abstract strings (to only optimize for concrete ones rather than requiring them, again like sequences). Fixing it slightly less cleanly with a hookable API might be more feasible (I’m pretty sure Nick Coghlan looked into it before the 3.3 string redesign; I don’t know if anyone has since), but it’s still probably a major change.

[Python-ideas] Re: Explicitly defining a string buffer object (aka StringIO += operator)

Andrew Barnert