
On Mar 30, 2020, at 22:03, Steven D'Aprano <steve@pearwood.info> wrote:
On Mon, Mar 30, 2020 at 01:59:42PM -0700, Andrew Barnert via Python-ideas wrote:
[...]
When you call getvalue() it then builds a Py_UCS4* representation that’s in this case 4x the size of the final string (since your string is pure ASCII and will be stored in UCS1, not UCS4). And then there’s the final string.
So, if this memory issue makes join unacceptable, it makes your optimization even more unacceptable.
You seem to be talking about a transient spike in memory usage, as the UCS4 string is built then disposed of. Paul seems to be talking about holding on to large numbers of substrings for long periods of time, possibly minutes or hours or even days in the case of a long running process.
But StringIO has the same long-term cost of the list, _plus_ a transient spike. There’s no way that can be better than just the same long-term cost. You can try to argue that it’s not that much worse, or that it isn’t worse in some cases, or that it could be optimized to not be as much worse; I’ll snip our all of those arguments because even if you’re right, it’s still not better. So this proposal amounts to changing Python, so that we can then get everyone to stop using the idiom they’ve been using for decades and use a different one, just to get maybe at best the same performance they already have. Why does that sound reasonable to you?
Whether StringIO takes advantage of that opportunity *right now* or not is, in a sense, irrelevent. It's an opportunity that lists don't have. Any (potential) inefficiency in StringIO could be improved, but it's baked into the design of lists that it *must* keep each string as a separate object.
The reason StringIO keeps a list (well, a C struct that’s almost the same thing as a list) is because it’s fast. It’s not the simplest implementation, it’s something that people put a lot of work into optimizing. Is it possible that someone could come up with something that’s even better for the main uses of StringiO (simulating a file) , and that also happens to be good for use as a string builder? Sure, I suppose it’s possible. But do you really think we should mame a change just so we can encourage people to switch to using something that’s slower and takes more memory (and doesn’t work in older versions of Python) just because it’s not impossible that one day someone will come up with a new optimization that makes it better instead of worse?
And if some specific implementation happens to have a particularly inefficient StringIO, that's a matter of quality of implementation and something for the users of that specific interpreter to take up with its maintainers. It's not a reason for use to reject Paul's proposal.
But if every implementation of StrjngIO, in every interpreter, is actually worse than joining lists, isn’t that a reason for us to reject the proposal?
And thinking about portable code makes it even worse. Your code might be run under CPython and take even more memory, or it might be run under a different Python implementation where StringIO is not accelerated (where it’s just a TextIOWrapper around a BytesIO) and therefore be a whole lot slower instead.
So wait, let me see if I understand your argument:
1. CPython's string concatentation is absolutely fine, even though it is demonstrably slower on 11 out of the 12 interpreters that Paul tested.
No. This is no part of my argument. The recommended way to handle building large strings out of lots of little strings is, and always has been, to join a list. It’s in the FAQ. It’s even baked into the code of CPython (see the error message from calling sum on strings). People should not be concatenating strings, but we don’t need to offer them a better solution because they already have a better solution.
2. The mere possibility of even a single hypothetical Python interpreter that has a slow and unoptimized StringIO buffer is enough to count against Paul's proposal.
No, the fact of every real life Python interpreter having a StringIO that’s at least a little worse than string join, and in some cases a lot worse, is enough to rule out the proposal. (The facts that StringIO also has the wrong semantics is less obvious for the purpose, and isn’t a decades-long established idiom are additional problems with the proposal. And the biggest problem is that the proposal is trying to fix a problem that doesn’t exist in the first place.)
Is that correct, or have I missed some nuance to your defence of string concatenation and rejection of Paul's proposal?
You haven’t missed any nuance, you’ve missed the entire point. I am not defending string concatenation, I’m defending the established idiom of join. I am not arguing to reject Paul’s proposal because it might theoretically be inefficient on some implementation, but because it definitely is inefficient on every existing implementation. And because it’s wrong to boot, and because it doesn’t solve any actual problem.
So it has to be able to deal with both of those possibilities, not just one; code that uses the usual idiom, on the other hand, behaves pretty similarly on all implementations.
The "usual idiom" being discussed here is repeated string concatenation,
No it isn’t. The usual idiom is join. It’s true that there are some people who never read the docs, never search StackOverflow or Python-list, never talk to other developers, etc., and abuse string concatenation. But giving them a second idiom isn’t going to change that—they’re still not going to read the docs, etc. We could give them 30 better ways to do it, and that won’t be any better than giving them 1 way.
My whole concern is along 2 lines:
1. This StringBuilder class *could* be an existing io.StringIO. 2. By just adding __iadd__ operator to it.
No, it really couldn’t. The semantics are wrong (unless you want, say, universal newline handling in your string builder?),
Ah, now *that* is a good point.
it’s optimized for a different use case than string building,
It is? That's odd. The whole purpose of StringIO is to build strings.
What use-case do you believe it is optimized for?
Guiro already answered this; but let me ask a followup question: Why would you think a class that’s in the io module, that implements the text file ABC (and doesn’t implement a string-builder API, hence Paul’s proposal), and that’s documented as a way to be “an in-memory stream for text I/O” would be optimized for use as a string builder instead of for use as an in-memory file object?