[Python-ideas] Re: Explicitly defining a string buffer object (aka StringIO += operator)

March 31, 2020

      Hello,

On Mon, 30 Mar 2020 13:59:42 -0700
Andrew Barnert <abarnert@yahoo.com> wrote:
...
On Mar 30, 2020, at 13:06, Paul Sokolovsky <pmiscml@gmail.com> wrote:
...
I appreciate expressing it all concisely and clearly. Then let me
respond here instead of the very first '"".join() rules!' reply I
got.
Ignoring replies doesn’t actually answer them.
I'm happy to discuss various points, but it would be nice to have
discussion focused, giving that the change proposed is pretty simple.
I'm not sure if it my fault by having tried to structure the original
RFC as a poor-man's PEP (so it's somewhat long'ish), but I definitely
would like to avoid discussing extended topics along the lines of
"there're some mundane languages which offer those string builder
classes, but Python is so, SO, special, that it doesn't need it, and
whoever thinks otherwise just doesn't get it" or "building a string
from pieces by putting pointers to pieces into array, and then
concatenating them together is the PEAK achievement of the computer
science, and whoever didn't get that just... just... didn't read
CPython (yes, CPython!) FAQ".
...
...
The issue with "".join() is very obvious:
------
import io
import sys
def strio():
   sb = io.StringIO()
   for i in range(50000):
       sb.write(u"==%d==" % i)
   print(sys.getsizeof(sb) + sys.getsizeof(sb.getvalue()))
This doesn’t tell you anything useful. As the help for getsizeof
makes clear, “Only the memory consumption directly attributed to the
object is accounted for, not the memory consumption of objects it
refers to”. So this gives you some fixed value like 152, no matter
how big the buffer and other internal objects may be.
Yeah, I tried to account for that with "sys.getsizeof(sb) +
sys.getsizeof(sb.getvalue())", thanks for noticing that.
...
If you’re using CPython with the C accelerator, none of those things
are available to you from the API, but a quick scan of the C source
shows what’s there, and it’s generally actually more storage than the
list version. Oversimplifying a bit: While you’re building, it keeps
a _PyAccu structure, which is basically a wrapper around that same
list of strings. When you call getvalue() it then builds a Py_UCS4*
representation that’s in this case 4x the size of the final string
(since your string is pure ASCII and will be stored in UCS1, not
UCS4). And then there’s the final string.
Thanks very much for this intro into the CPython
io.StringIO implementation, much appreciated. Please let me return the
favor and explain how StringIO implemented in Pycopy, which I happen to
maintain, and in MicroPython (as the original implementation was written
by me there). So, there's an array of bytes. Both implementations use
utf-8 to store strings. So, StringIO stores as many bytes as there're
actual data in (utf-8) strings. Of course, there's some over-allocation
policy to avoid severe quadratic behavior on growing. Overall, storing N
bytes of string data requires N + small % of N bytes of data. No
additional array of pointers is needed. Original constituent strings
(each over-allocated of course) can be GCed in the meantime.

The moral is known, and was stated in the original RFC: for as
long as somebody's attention is fixated on CPython, the likely
reply from them would be: "there's no problem with CPython3, so there's
nothing to fix". It takes to step up, think about *multiple*
implementations and *interface* they *can* offer.
...
So, if this memory issue makes join unacceptable, it makes your
optimization even more unacceptable.
And thinking about portable code makes it even worse. Your code might
be run under CPython and take even more memory, or it might be run
under a different Python implementation where StringIO is not
accelerated (where it’s just a TextIOWrapper around a BytesIO) and
therefore be a whole lot slower instead. So it has to be able to deal
with both of those possibilities, not just one; code that uses the
usual idiom, on the other hand, behaves pretty similarly on all
implementations.
Indeed, it absolutely and guaranteedly wastes a lot of memory. (It's
also the fastest, no worries.)
...
...
There's absolutely no need why performing trivial operation of
accumulating string content should take about order of magnitude
more memory than actually needed for that string content. Don't get
me wrong
- if you want to spend that much of your memory, then sure, you
can. But jumping with that as *the only right solution* whenever
somebody mentions "string concatenation" is a bit ... umm,
cavalier
And making a wild guess about how things might be implemented and
offering an optimization based on that guess that actually makes
things worse and refusing to even reply when people point out the
problems isn’t even more cavalier?
The point I tried to show is that StringIO is never worse than str +=
regarding performance (stats for 8 implementations were demonstrated).
What went implied is that it can be also very memory-efficient, but
thanks to your thorough attention, that now was made explicit, with an
implementation (very simple and obvious!) on achieving that described.

I'm sorry to hear about deficiencies in StringIO implementation of
your favorite Python implementation. On the positive side, now that
they're identified, they can be fixed (if there's a need to care
about them for that particular implementation).

Likewise, I'm sorry for now showing a full possible extent of
appreciation of your joining the discussion of the "StringIO vs str +="
matters with claims like "str.join is the fastest!!", with myself not
showing that fullest extent of appreciation by repeatedly calling to
stay on the topic of improving interface for string building to be on
the same level as simple and obvious "str +=". I still tried to answer
why str.join can't be a universal solution for all cases, I'm sorry if I
failed to do that.
...
...
My whole concern is along 2 lines:
1. This StringBuilder class *could* be an existing io.StringIO.
2. By just adding __iadd__ operator to it.
No, it really couldn’t. The semantics are wrong (unless you want,
say, universal newline handling in your string builder?), it’s
optimized for a different use case than string building, and both the
pure-Python and CPython accelerator implementations are less
efficient in speed and/or memory.
Less efficient than what? I start with simple and obvious
"str +=", but vividly inefficient across different Python
implementations. I proceed with proposing how with a very simple
change, simplicity and obviousness of "str +=" can be retained, while
runtime efficiency can be dramatically improved (without any special
implied memory use deficiencies).

You keep pushing that "there's a faster way to do it". Yes, you're
right - there's. But my proposal was never about "fastest string concat
in the west", or it would have been about rewriting some code in
assembler.
...
...
That's it, nothing else. What's inside StringIO class is up to you
(dear various Python implementations, their maintainers, and
contributors).
Sure, but what’s inside has to actually perform the job it was
designed to do and is documented to do: to simulate a file object in
memory. Which is not the same thing as being a string builder.
Once somebody would try to implement a dedicated "string builder", they
would find that it's some 80% similar to "simulate a file object in
memory". On average. I'm sorry to hear about outlier implementations
where (per your words), similarity is less than that.
...
...
For example, fans of "".join() surely can have it inside. Actually,
it's a known fact that Python2's "StringIO" module (the original
home of StringIO class) was implemented exactly like that, so you
can go straight back to the future.
Python2’s StringIO module is for bytes, not Unicode strings.
It just occurred to me: maybe I chose the wrong class for running
discussion, maybe that should have been BytesIO, and you'd be half won
over by now? ;-)
...
If you
want a mutable bytes-like type, bytearray already exists; there’s no
need to wrap the sequence up in a file-like API just to rewrap that
in a sequence-like API again;
I humbly disagree. And the motivation is exactly parallel to that of
str vs io.StringIO. For (binary)string-builder, you constantly need to
grow its internal buffer. You also need to do the same for "simulating a
file in memory". Then once you have an object which does that
(hopefully efficiently, again "ah" to those which don't), you don't
need to complicate implementation of other objects to optimize for
the "growing" case. Just use an object suitable for a particular
usecase: bytearray for inplace updates, and BytesIO for
growing-construction.

I'm sorry in advance if FAQ for your Python implementation doesn't
provide such suggestions. FAQs for other Python implementation very
well may.
...
just use the sequence directly. What
StringIO is there for is when you _need_ the file API, just as in
Python 3’s io.BytesIO. It’s not a more efficient bytearray or one
better suited for string building; it’s less efficient and less well
suited for string building but it adds different features.
...
And again, the need for anything like that might be unclear for
CPython-only users. Such users can write a StringBuilder class like
above, or repeat the beautiful "".join() trick over and over again.
The need for a nice string builder class may occur only from the
consideration of the Python-as-a-language lacking a clear and nice
abstraction for it, and from thinking how to add such an
abstraction in a performant way (of which criteria are different)
in as many implementation as possible, in as easy as possible way.
(At least that's my path to it, I'm not sure if a different thought
process might lead to it too.)
The problem isn’t your start, it’s jumping to the assumption that
StringIO must be an answer, and then not checking the docs and the
Wrong claim. I just suggest that it *can* be an answer.
...
code to see if there are problems, and then ignoring the problems
when they’re pointed out. Why do you think a virtual file object must
be the optimal way to implement a string builder in the first place?
Wrong claim: I don't say "optimal" (after all, you suggested that
there's a faster way, and in some cases that can be "optimal"). I
would say a "good compromise".

-- 
Best regards,
 Paul                          mailto:pmiscml@gmail.com