[pypy-dev] [Python-ideas] Re: Explicitly defining a string buffer object (aka StringIO += operator)

Steven D'Aprano steve at pearwood.info
Mon Mar 30 22:45:55 EDT 2020


On Mon, Mar 30, 2020 at 10:24:02AM -0700, Andrew Barnert via Python-ideas wrote:
> On Mar 30, 2020, at 10:01, Brett Cannon <brett at python.org> wrote:

[talking about string concatenation]
 
> > I don't think characterizing this as a "mis-optimization" is fair. 
[...]
> Yes. A big part of the reason there’s so much use in the wild is that 
> for small cases that aren’t in the middle of a bottleneck, it’s 
> perfectly reasonable for people to add two or three strings and not 
> care about performance. (Who cares about N**2 when N<=15 and it 
> happens at most 4 times per run of your program?) 

When you're talking about N that small (2 or 4, say), it is quite 
possible that the overhead of constructing a list then looking up and 
calling a method may be greater than that of string concatenation, even 
without the optimization. I wouldn't want to bet either way without 
benchmarks, and I wouldn't trust the benchmarks from one machine to 
apply to another.


> So people do it, and 
> it’s fine. When they really do need to optimize, a quick search of the 
> FAQ or StackOverflow or whatever will tell them the right way to do 
> it, and they do it, but most of the time it doesn’t matter.

Ah, but that's the rub. How often do they know they need to do that 
"quick search"? Unless they get bitten by poor performance, and spend 
the time to profile their script and discover the cause of the slow 
down, how would they know what the cause was?

If people already know about the string concatenation trap, they don't 
need a quick search, and they're probably not writing repeated 
concatenation for arbitrary N in the first place.

Although I have come across a few people who are completely dismissive 
of the idea of using cross-platform best practices. Even actively 
hostile to the idea that they should avoid idioms that will perform 
badly on other interpreters.

On the third hand, if they don't know about the trap, then it won't be a 
quick search because they don't know what to search for (unless it's 
"why is Python so slow?" which won't be helpful).



Disclaimer: intellectually, I like the CPython string concatenation 
optimization. It's clever, a Neat Hack, I really admire it. But I can't 
help feeling that, *just maybe*, it's a misplaced optimization, and if 
it were proposed today when we are more concerned about alternative 
interpreters, we might not have accepted it.

Perhaps if CPython didn't dominate the ecosystem so completely, and more 
people wrote cross-platform code that was run across multiple 
interpreters, we wouldn't be quite so keen on an optimization that 
encourages quadratic behaviour half the time. So even though I don't 
*quite* agree with Paul, I can see that from the perspective of people 
using alternate interpreters, this CPython optimization could easily be 
characterized as a mis-optimization.

"Why is CPython encouraging people to use an idiom that is all but 
guaranteed to be hideously slow on everyone else's interpreter?"

Since Brett brought up the notion of fairness, one might even be 
forgiven for considering that such an optimization in the reference 
interpreter, knowing that most of the other interpreters cannot match 
it, is an unfair, aggressive, anti-competitive action.

Personally I wouldn't go quite so far. But I can see why people who are 
passionate about alternate interpeters might feel that this optimization 
is both harmful and unfair on the greater Python ecosystem.

Apart from cross-platform issues, another risk with the concat 
optimization is that it's quite fragile and sensitive to the exact form 
of your code. A small, seemingly insignificant change to your code can 
have enormous consequences:

In [1]: strings = ['abc']*500000

In [2]: %%timeit
   ...: s = ''
   ...: for x in strings:
   ...:     s = s+x
   ...:
36.4 ms ± 313 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %%timeit
   ...: s = ''
   ...: for x in strings:
   ...:     s = t = s+x
   ...:
59.7 s ± 799 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That's more than a thousand times slower.

And I think people often underestimate how painful it can be to debug 
performance problems caused by this. If you haven't been burned by it 
before, it may not be obvious just how risky repeated concatenation can 
be. Here is an example from real life.

In 2009, about four years after the in-place string concatenation 
optimization was added to CPython, Chris Withers asked for help 
debugging a problem where Python httplib was literally hundreds of times 
slower than other tools, like wget and Internet Explorer:

https://mail.python.org/pipermail/python-dev/2009-August/091125.html

A few weeks later, Simon Cross realised the problem was probably the 
quadratic behaviour of repeated string addition:

https://mail.python.org/pipermail/python-dev/2009-September/091582.html

leading to this quote from Antoine Pitrou:

"Given differences between platforms in realloc() performance, it might 
be the reason why it goes unnoticed under Linux but degenerates under 
Windows."

https://mail.python.org/pipermail/python-dev/2009-September/091583.html

and Guido's comment:

"Also agreed that this is an embarrassment."

https://mail.python.org/pipermail/python-dev/2009-September/091592.html


So even in CPython, it isn't inconceivable that the concat optimization 
may fail and you will have hideously slow code.

At this point, I think that CPython is stuck with this optimization, for 
good or ill. Removing it will just hurt a bunch of code that currently 
performs adequately. But I can't help but feel that, knowing what we 
know now, there's a good chance that if that optimization were proposed 
now rather than in 2.4, we might not accept it.


> Maybe the OP could argue that this was a bad decision by finding 
> examples of code that actually relies on that optimization despite 
> being intended to be portable to other implementations. It’s worth 
> comparing the case of calling sum on strings—which is potentially 
> abused more often than used harmlessly, so instead of optimizing it, 
> CPython made it an error. But without any such known examples, it’s 
> hard not to call the string concatenation optimization a win.

Does the CPython standard library count? See above.



-- 
Steven


More information about the pypy-dev mailing list