[Python-ideas] Re: Explicitly defining a string buffer object (aka StringIO += operator)

March 31, 2020

      On Mon, Mar 30, 2020 at 01:59:42PM -0700, Andrew Barnert via Python-ideas wrote:

[...]
...
When you call getvalue() it then builds a Py_UCS4* 
representation that’s in this case 4x the size of the final string 
(since your string is pure ASCII and will be stored in UCS1, not 
UCS4). And then there’s the final string.
So, if this memory issue makes join unacceptable, it makes your 
optimization even more unacceptable.
You seem to be talking about a transient spike in memory usage, as the 
UCS4 string is built then disposed of. Paul seems to be talking about 
holding on to large numbers of substrings for long periods of time, 
possibly minutes or hours or even days in the case of a long running 
process.

If StringIO.getvalue() builds an unnecessary UCS4 string, that's an 
obvious opportunity for optimization. Regardless of whether people use 
StringIO by calling the write() method or Paul's proposed `+=` this 
optimization might still be useful. 

In any case, throw in one emoji into your buffer, just one, and the 
whole point becomes moot. Whether you are using StringIO or list.append 
plus join, you still end up with a UCS4 string at the end.

I don't understand the CPython implementation very well, I barely know 
any C at all, but your argument seems a bit dubious to me. Regardless of 
the implementation, if you accumulate N code points, it takes a minimum 
of N by the width of a code point to store that buffer. With a StringIO 
buffer, there is at least the opportunity to keep them all in a single 
buffer with minimal overhead:

    buf --> [CCCC]  # four code points, each of 4 bytes in UCS4

With a list, you have significantly more overhead. For the sake of 
discussion, let's say you build it from four one-character strings.

    lst --> [PPPP]  # four pointers to str objects

Each pointer will take eight bytes on modern 64-bit systems, so that's 
already double the size of buf. Then there is the object overhead of the 
four strings, which is *particularly* acute for single ASCII chars. 50 
bytes for a one byte ASCII char. So in the worst case, every char you 
add to your buffer takes 58 bytes in a list versus 4 for a StringIO that 
uses UCS4 internally.

Whether StringIO takes advantage of that opportunity *right now* or not 
is, in a sense, irrelevent. It's an opportunity that lists don't have. 
Any (potential) inefficiency in StringIO could be improved, but it's 
baked into the design of lists that it *must* keep each string as a 
separate object.

Of course there are only 128 unique ASCII characters, and interning 
reduces some of that overhead. But even in the best case where you are 
appending large strings there's always going to be more memory overhead 
in a list that a buffer has the opportunity to avoid.

And if some specific implementation happens to have a particularly 
inefficient StringIO, that's a matter of quality of implementation and 
something for the users of that specific interpreter to take up with its 
maintainers. It's not a reason for use to reject Paul's proposal.
...
And thinking about portable code makes it even worse. Your code might 
be run under CPython and take even more memory, or it might be run 
under a different Python implementation where StringIO is not 
accelerated (where it’s just a TextIOWrapper around a BytesIO) and 
therefore be a whole lot slower instead.
So wait, let me see if I understand your argument:

1. CPython's string concatentation is absolutely fine, even though it is 
demonstrably slower on 11 out of the 12 interpreters that Paul tested.

2. The mere possibility of even a single hypothetical Python interpreter 
that has a slow and unoptimized StringIO buffer is enough to count 
against Paul's proposal.

Is that correct, or have I missed some nuance to your defence of string 
concatenation and rejection of Paul's proposal?
...
So it has to be able to deal 
with both of those possibilities, not just one; code that uses the 
usual idiom, on the other hand, behaves pretty similarly on all 
implementations.
The "usual idiom" being discussed here is repeated string concatenation, 
which certainly does not behave similarly on all implementations. 
Unless, of course, you're referring to it performing *really poorly* on 
all implementations except CPython.
...
...
My whole concern is along 2 lines:
1. This StringBuilder class *could* be an existing io.StringIO.
2. By just adding __iadd__ operator to it.
No, it really couldn’t. The semantics are wrong (unless you want, say, 
universal newline handling in your string builder?),
Ah, now *that* is a good point.
...
it’s optimized for a different use case than string building,
It is? That's odd. The whole purpose of StringIO is to build strings.

What use-case do you believe it is optimized for?

-- 
Steven

[Python-ideas] Re: Explicitly defining a string buffer object (aka StringIO += operator)

Steven D'Aprano