Re: Explicitly defining a string buffer object (aka StringIO += operator)

On Tue, Mar 31, 2020 at 12:39:56AM -0700, Christopher Barker wrote:
Indeed it would, and if I ask for a nutcracker to crack open peanuts a nuclear powered 200 tonne bulldozer would satisfy the requirements too. What Paul asked for: https://media.qcsupply.com/media/catalog/product/cache/122b61bfb663175d7f1bb... What we're talking about instead: http://media.firebox.com/pic/p1861_column_grid_12.jpg If you want a mutable string type, you have to support a rather extensive string API that includes at least ten operators: + * % == != in < <= >= > plus slicing and about 45 methods. (There may be some things I have missed.) Paul asked for *one* operator, `+=`.
Yes, rather like the way you have to call `''.join(buf)` if you use a list. If your point is a criticism of the StringIO proposal, it is equally a criticism of the recommended list+join idiom, but we still tell people to use list+join, so it can't be a very important issue. And if it's not a valid criticism of the list+join idiom, or at least a very minor, unimportant one, then precisely the same applies to StringIO.
So how exactly does this meet the use case of being able to drop it into code that’s written to use strings?
This has been covered at least twice, once by Paul and once by me. When you are refactoring the "string concatenation" idiom to list append, you have to change three things: 1. the buffer initialisation: `buf = ''` --> `buf = []`; 2. add a conversion to the end: `result = ''.join(buf)`; 3. and potentially dozens of instances of `buf += s` to `buf.append(s)`. The third may be scattered around multiple functions. It is not always an easy refactoring. We should not assume that every append to the buffer will be in a single place or even a single function. Under Paul's suggestion, you change the buffer initialisation, add the conversion, and nothing else needs to be touched. You also get a performance boost (by my testing, using StringIO is about halfway in speed between string concat and list+join), and potentially a memory saving relative to the list version. (Although Paul's calculations on that have been disputed, and I don't think we have a definite answer on that one just yet.)
BTW, what about folks that concatenaste strings with plain + ? You couldn’t drop StringIO in there reasonably either.
Do you mean people who write `buf = buf + substring`? Maybe that's an argument to support `+` as well. Perhaps it would be weird and surprising to support augmented assignment for an operator without supporting the operator? But I don't have a strong feeling either way. Or maybe that's just an argument that no solution is going to solve *every* problem. What do we do about people who write this: buf = f'{buf}{substring}' inside a loop? We can't fix everyone's code with one change.
It's not really. The point is to minimise how many things need to change. Instead of changing every `buf += s` into `buf.append(s)`, you just leave them alone. In that regard, it's *the same way* of doing it, the only difference is that the buffer changes from an immutable string to a builder. We should be careful to avoid exaggerating tiny differences, especially when that involves a double-standard. Is it a problem for the recommended list+join idiom that it is "a whole other way of doing it"? If no, then its not a problem for StringIO either. If the answer is yes, then StringIO is no worse than what people are already told to do, so it's not really much of a problem.
And the list and join() method uses two of the most common builtins— there is a real advantage to that.
Again, we must beware of double-standards. This list often says: "Not everything needs to be a builtin, just put it in a module." Also this list: "Being a builtin is a real advantage, we can't use StringIO because it comes from a module that needs importing." We can't have it both ways. In this case, using StringIO does involve one extra import. Okay, that's fine, the standard library is full of things that require one extra import but nevertheless are a common, or even sometimes recommended, way to do it. itertools comes to mind especially, using iterators in Python is ubiquitous and yet we still require people to import a module to do something as fundamental as slice an iterator. But maybe we can consider an alternative that doesn't require that import.
Anyway, I’m not proposing a mutable string type. It was really just an example of: if this was really needed, it would have been done.
Ah, so by that argument, *every* new proposal can be immediately rejected. If this were really needed, it would have already been done. Python 3.8 is the optimal language with literally every single desirable feature, nothing to be added, and no unneeded cruft to be removed. I trust that's not what you intended to say :-) -- Steven

On Thu, Apr 2, 2020 at 3:27 PM Steven D'Aprano <steve@pearwood.info> wrote:
I don't know whether your point was that this is bad code and can't be optimized, or that it's good code but still can't be optimized by this proposal. But if the former, then I put it to you that this isn't actually bad code. text = "" for thing in stuff: # Option 1: text += f"{thing.id}: {thing.name} ({thing.cat})\n" # Option 2: text = f"{text}{thing.id}: {thing.name} ({thing.cat})\n" Which is going to be (a) faster, and (b) more memory-efficient? What if you change interpreters? Does it make a difference whether the amount added per iteration is large or small? What if most of the content is ASCII but there's one single non-ASCII character somewhere? It's entirely viable for both forms to exist in the wild, and to be justifiable. If your argument was that this code is perfectly fine and there's just too many ways to write good code and we can't hope to optimize them all, then I apologize, and this post is irrelevant :) ChrisA

On Thu, Apr 02, 2020 at 03:37:34PM +1100, Chris Angelico wrote:
Neither. It's that anyone building a string like that isn't the target of this proposal. I could have given various examples, I just happened to pick f-string: buf = '%s%s' % (buf, substring) buf = '{}{}'.format(buf, substring) buf = ''.join(itertools.chain(buf, substring)) Put any of them into a loop, and they are likely to be exceedingly slow for large N. But fixing them is not part of Paul's proposal.
But if the former, then I put it to you that this isn't actually bad code.
Repeated string concatenation doesn't suddenly become efficient just because you wave a magic f-string at it. That appends substring to the buffer each time through the loop, giving quadratic performance. On my computer, appending a single character 'a' to the buffer each time, I get: 10_000 loops, 0.4 seconds (actual time) 50_000 loops, 9.8 seconds # expect 5*0.4 = 2 seconds 100_000 loops 39 seconds # expect 10*0.4 = 4 seconds 200_000 loops 160 seconds # expect 20*0.4 = 8 seconds The expected times are assuming that the time is proportional to the number of elements, i.e. O(N). If we assume O(N**2) then the expected times would be 10, 40 and 160 seconds, so this is a textbook example of quadratic slowdown. (At least on my computer.) -- Steven

On Thu, Apr 2, 2020 at 5:58 PM Steven D'Aprano <steve@pearwood.info> wrote:
And they'd all be equivalent to the consideration I'm talking about (it's not f-string specific).
This is true; however, they're all optimizable in different ways. CPython happens to have an optimization for the += case, but it's equally possible for some other form to have an optimization.
Yep. Because CPython doesn't optimize any of them. So it's very interpreter-specific to take advantage of +=, and would be identically interpreter-specific to take advantage of any other form. This proposal continues to favour the += spelling by making it higher performing. ChrisA

On Thu, Apr 2, 2020 at 3:27 PM Steven D'Aprano <steve@pearwood.info> wrote:
I don't know whether your point was that this is bad code and can't be optimized, or that it's good code but still can't be optimized by this proposal. But if the former, then I put it to you that this isn't actually bad code. text = "" for thing in stuff: # Option 1: text += f"{thing.id}: {thing.name} ({thing.cat})\n" # Option 2: text = f"{text}{thing.id}: {thing.name} ({thing.cat})\n" Which is going to be (a) faster, and (b) more memory-efficient? What if you change interpreters? Does it make a difference whether the amount added per iteration is large or small? What if most of the content is ASCII but there's one single non-ASCII character somewhere? It's entirely viable for both forms to exist in the wild, and to be justifiable. If your argument was that this code is perfectly fine and there's just too many ways to write good code and we can't hope to optimize them all, then I apologize, and this post is irrelevant :) ChrisA

On Thu, Apr 02, 2020 at 03:37:34PM +1100, Chris Angelico wrote:
Neither. It's that anyone building a string like that isn't the target of this proposal. I could have given various examples, I just happened to pick f-string: buf = '%s%s' % (buf, substring) buf = '{}{}'.format(buf, substring) buf = ''.join(itertools.chain(buf, substring)) Put any of them into a loop, and they are likely to be exceedingly slow for large N. But fixing them is not part of Paul's proposal.
But if the former, then I put it to you that this isn't actually bad code.
Repeated string concatenation doesn't suddenly become efficient just because you wave a magic f-string at it. That appends substring to the buffer each time through the loop, giving quadratic performance. On my computer, appending a single character 'a' to the buffer each time, I get: 10_000 loops, 0.4 seconds (actual time) 50_000 loops, 9.8 seconds # expect 5*0.4 = 2 seconds 100_000 loops 39 seconds # expect 10*0.4 = 4 seconds 200_000 loops 160 seconds # expect 20*0.4 = 8 seconds The expected times are assuming that the time is proportional to the number of elements, i.e. O(N). If we assume O(N**2) then the expected times would be 10, 40 and 160 seconds, so this is a textbook example of quadratic slowdown. (At least on my computer.) -- Steven

On Thu, Apr 2, 2020 at 5:58 PM Steven D'Aprano <steve@pearwood.info> wrote:
And they'd all be equivalent to the consideration I'm talking about (it's not f-string specific).
This is true; however, they're all optimizable in different ways. CPython happens to have an optimization for the += case, but it's equally possible for some other form to have an optimization.
Yep. Because CPython doesn't optimize any of them. So it's very interpreter-specific to take advantage of +=, and would be identically interpreter-specific to take advantage of any other form. This proposal continues to favour the += spelling by making it higher performing. ChrisA
participants (2)
-
Chris Angelico
-
Steven D'Aprano