Re: [pypy-dev] [Python-ideas] Re: Explicitly defining a string buffer object (aka StringIO += operator)

On Mon, Mar 30, 2020 at 10:24:02AM -0700, Andrew Barnert via Python-ideas wrote:
On Mar 30, 2020, at 10:01, Brett Cannon <brett@python.org> wrote:
[talking about string concatenation]
When you're talking about N that small (2 or 4, say), it is quite possible that the overhead of constructing a list then looking up and calling a method may be greater than that of string concatenation, even without the optimization. I wouldn't want to bet either way without benchmarks, and I wouldn't trust the benchmarks from one machine to apply to another.
Ah, but that's the rub. How often do they know they need to do that "quick search"? Unless they get bitten by poor performance, and spend the time to profile their script and discover the cause of the slow down, how would they know what the cause was? If people already know about the string concatenation trap, they don't need a quick search, and they're probably not writing repeated concatenation for arbitrary N in the first place. Although I have come across a few people who are completely dismissive of the idea of using cross-platform best practices. Even actively hostile to the idea that they should avoid idioms that will perform badly on other interpreters. On the third hand, if they don't know about the trap, then it won't be a quick search because they don't know what to search for (unless it's "why is Python so slow?" which won't be helpful). Disclaimer: intellectually, I like the CPython string concatenation optimization. It's clever, a Neat Hack, I really admire it. But I can't help feeling that, *just maybe*, it's a misplaced optimization, and if it were proposed today when we are more concerned about alternative interpreters, we might not have accepted it. Perhaps if CPython didn't dominate the ecosystem so completely, and more people wrote cross-platform code that was run across multiple interpreters, we wouldn't be quite so keen on an optimization that encourages quadratic behaviour half the time. So even though I don't *quite* agree with Paul, I can see that from the perspective of people using alternate interpreters, this CPython optimization could easily be characterized as a mis-optimization. "Why is CPython encouraging people to use an idiom that is all but guaranteed to be hideously slow on everyone else's interpreter?" Since Brett brought up the notion of fairness, one might even be forgiven for considering that such an optimization in the reference interpreter, knowing that most of the other interpreters cannot match it, is an unfair, aggressive, anti-competitive action. Personally I wouldn't go quite so far. But I can see why people who are passionate about alternate interpeters might feel that this optimization is both harmful and unfair on the greater Python ecosystem. Apart from cross-platform issues, another risk with the concat optimization is that it's quite fragile and sensitive to the exact form of your code. A small, seemingly insignificant change to your code can have enormous consequences: In [1]: strings = ['abc']*500000 In [2]: %%timeit ...: s = '' ...: for x in strings: ...: s = s+x ...: 36.4 ms ± 313 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [3]: %%timeit ...: s = '' ...: for x in strings: ...: s = t = s+x ...: 59.7 s ± 799 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) That's more than a thousand times slower. And I think people often underestimate how painful it can be to debug performance problems caused by this. If you haven't been burned by it before, it may not be obvious just how risky repeated concatenation can be. Here is an example from real life. In 2009, about four years after the in-place string concatenation optimization was added to CPython, Chris Withers asked for help debugging a problem where Python httplib was literally hundreds of times slower than other tools, like wget and Internet Explorer: https://mail.python.org/pipermail/python-dev/2009-August/091125.html A few weeks later, Simon Cross realised the problem was probably the quadratic behaviour of repeated string addition: https://mail.python.org/pipermail/python-dev/2009-September/091582.html leading to this quote from Antoine Pitrou: "Given differences between platforms in realloc() performance, it might be the reason why it goes unnoticed under Linux but degenerates under Windows." https://mail.python.org/pipermail/python-dev/2009-September/091583.html and Guido's comment: "Also agreed that this is an embarrassment." https://mail.python.org/pipermail/python-dev/2009-September/091592.html So even in CPython, it isn't inconceivable that the concat optimization may fail and you will have hideously slow code. At this point, I think that CPython is stuck with this optimization, for good or ill. Removing it will just hurt a bunch of code that currently performs adequately. But I can't help but feel that, knowing what we know now, there's a good chance that if that optimization were proposed now rather than in 2.4, we might not accept it.
Does the CPython standard library count? See above. -- Steven
participants (1)
-
Steven D'Aprano