
Hello, On Mon, 30 Mar 2020 13:59:42 -0700 Andrew Barnert <abarnert@yahoo.com> wrote:
On Mar 30, 2020, at 13:06, Paul Sokolovsky <pmiscml@gmail.com> wrote:
I appreciate expressing it all concisely and clearly. Then let me respond here instead of the very first '"".join() rules!' reply I got.
Ignoring replies doesn’t actually answer them.
I'm happy to discuss various points, but it would be nice to have discussion focused, giving that the change proposed is pretty simple. I'm not sure if it my fault by having tried to structure the original RFC as a poor-man's PEP (so it's somewhat long'ish), but I definitely would like to avoid discussing extended topics along the lines of "there're some mundane languages which offer those string builder classes, but Python is so, SO, special, that it doesn't need it, and whoever thinks otherwise just doesn't get it" or "building a string from pieces by putting pointers to pieces into array, and then concatenating them together is the PEAK achievement of the computer science, and whoever didn't get that just... just... didn't read CPython (yes, CPython!) FAQ".
The issue with "".join() is very obvious:
------ import io import sys
def strio(): sb = io.StringIO() for i in range(50000): sb.write(u"==%d==" % i) print(sys.getsizeof(sb) + sys.getsizeof(sb.getvalue()))
This doesn’t tell you anything useful. As the help for getsizeof makes clear, “Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to”. So this gives you some fixed value like 152, no matter how big the buffer and other internal objects may be.
Yeah, I tried to account for that with "sys.getsizeof(sb) + sys.getsizeof(sb.getvalue())", thanks for noticing that.
If you’re using CPython with the C accelerator, none of those things are available to you from the API, but a quick scan of the C source shows what’s there, and it’s generally actually more storage than the list version. Oversimplifying a bit: While you’re building, it keeps a _PyAccu structure, which is basically a wrapper around that same list of strings. When you call getvalue() it then builds a Py_UCS4* representation that’s in this case 4x the size of the final string (since your string is pure ASCII and will be stored in UCS1, not UCS4). And then there’s the final string.
Thanks very much for this intro into the CPython io.StringIO implementation, much appreciated. Please let me return the favor and explain how StringIO implemented in Pycopy, which I happen to maintain, and in MicroPython (as the original implementation was written by me there). So, there's an array of bytes. Both implementations use utf-8 to store strings. So, StringIO stores as many bytes as there're actual data in (utf-8) strings. Of course, there's some over-allocation policy to avoid severe quadratic behavior on growing. Overall, storing N bytes of string data requires N + small % of N bytes of data. No additional array of pointers is needed. Original constituent strings (each over-allocated of course) can be GCed in the meantime. The moral is known, and was stated in the original RFC: for as long as somebody's attention is fixated on CPython, the likely reply from them would be: "there's no problem with CPython3, so there's nothing to fix". It takes to step up, think about *multiple* implementations and *interface* they *can* offer.
So, if this memory issue makes join unacceptable, it makes your optimization even more unacceptable.
And thinking about portable code makes it even worse. Your code might be run under CPython and take even more memory, or it might be run under a different Python implementation where StringIO is not accelerated (where it’s just a TextIOWrapper around a BytesIO) and therefore be a whole lot slower instead. So it has to be able to deal with both of those possibilities, not just one; code that uses the usual idiom, on the other hand, behaves pretty similarly on all implementations.
Indeed, it absolutely and guaranteedly wastes a lot of memory. (It's also the fastest, no worries.)
There's absolutely no need why performing trivial operation of accumulating string content should take about order of magnitude more memory than actually needed for that string content. Don't get me wrong - if you want to spend that much of your memory, then sure, you can. But jumping with that as *the only right solution* whenever somebody mentions "string concatenation" is a bit ... umm, cavalier
And making a wild guess about how things might be implemented and offering an optimization based on that guess that actually makes things worse and refusing to even reply when people point out the problems isn’t even more cavalier?
The point I tried to show is that StringIO is never worse than str += regarding performance (stats for 8 implementations were demonstrated). What went implied is that it can be also very memory-efficient, but thanks to your thorough attention, that now was made explicit, with an implementation (very simple and obvious!) on achieving that described. I'm sorry to hear about deficiencies in StringIO implementation of your favorite Python implementation. On the positive side, now that they're identified, they can be fixed (if there's a need to care about them for that particular implementation). Likewise, I'm sorry for now showing a full possible extent of appreciation of your joining the discussion of the "StringIO vs str +=" matters with claims like "str.join is the fastest!!", with myself not showing that fullest extent of appreciation by repeatedly calling to stay on the topic of improving interface for string building to be on the same level as simple and obvious "str +=". I still tried to answer why str.join can't be a universal solution for all cases, I'm sorry if I failed to do that.
My whole concern is along 2 lines:
1. This StringBuilder class *could* be an existing io.StringIO. 2. By just adding __iadd__ operator to it.
No, it really couldn’t. The semantics are wrong (unless you want, say, universal newline handling in your string builder?), it’s optimized for a different use case than string building, and both the pure-Python and CPython accelerator implementations are less efficient in speed and/or memory.
Less efficient than what? I start with simple and obvious "str +=", but vividly inefficient across different Python implementations. I proceed with proposing how with a very simple change, simplicity and obviousness of "str +=" can be retained, while runtime efficiency can be dramatically improved (without any special implied memory use deficiencies). You keep pushing that "there's a faster way to do it". Yes, you're right - there's. But my proposal was never about "fastest string concat in the west", or it would have been about rewriting some code in assembler.
That's it, nothing else. What's inside StringIO class is up to you (dear various Python implementations, their maintainers, and contributors).
Sure, but what’s inside has to actually perform the job it was designed to do and is documented to do: to simulate a file object in memory. Which is not the same thing as being a string builder.
Once somebody would try to implement a dedicated "string builder", they would find that it's some 80% similar to "simulate a file object in memory". On average. I'm sorry to hear about outlier implementations where (per your words), similarity is less than that.
For example, fans of "".join() surely can have it inside. Actually, it's a known fact that Python2's "StringIO" module (the original home of StringIO class) was implemented exactly like that, so you can go straight back to the future.
Python2’s StringIO module is for bytes, not Unicode strings.
It just occurred to me: maybe I chose the wrong class for running discussion, maybe that should have been BytesIO, and you'd be half won over by now? ;-)
If you want a mutable bytes-like type, bytearray already exists; there’s no need to wrap the sequence up in a file-like API just to rewrap that in a sequence-like API again;
I humbly disagree. And the motivation is exactly parallel to that of str vs io.StringIO. For (binary)string-builder, you constantly need to grow its internal buffer. You also need to do the same for "simulating a file in memory". Then once you have an object which does that (hopefully efficiently, again "ah" to those which don't), you don't need to complicate implementation of other objects to optimize for the "growing" case. Just use an object suitable for a particular usecase: bytearray for inplace updates, and BytesIO for growing-construction. I'm sorry in advance if FAQ for your Python implementation doesn't provide such suggestions. FAQs for other Python implementation very well may.
just use the sequence directly. What StringIO is there for is when you _need_ the file API, just as in Python 3’s io.BytesIO. It’s not a more efficient bytearray or one better suited for string building; it’s less efficient and less well suited for string building but it adds different features.
And again, the need for anything like that might be unclear for CPython-only users. Such users can write a StringBuilder class like above, or repeat the beautiful "".join() trick over and over again. The need for a nice string builder class may occur only from the consideration of the Python-as-a-language lacking a clear and nice abstraction for it, and from thinking how to add such an abstraction in a performant way (of which criteria are different) in as many implementation as possible, in as easy as possible way. (At least that's my path to it, I'm not sure if a different thought process might lead to it too.)
The problem isn’t your start, it’s jumping to the assumption that StringIO must be an answer, and then not checking the docs and the
Wrong claim. I just suggest that it *can* be an answer.
code to see if there are problems, and then ignoring the problems when they’re pointed out. Why do you think a virtual file object must be the optimal way to implement a string builder in the first place?
Wrong claim: I don't say "optimal" (after all, you suggested that there's a faster way, and in some cases that can be "optimal"). I would say a "good compromise". -- Best regards, Paul mailto:pmiscml@gmail.com