[Python-Dev] sum(...) limitation

Tue Aug 12 03:21:15 CEST 2014

Sorry for the bike shedding here, but:

The quadratic behaviour of repeated str summation is a subtle, silent error.

OK, fair enough. I suppose it would be hard and ugly to catch those
instances and raise an exception pointing users to "".join.

*is* controversial that CPython silently optimises some cases of it away,
since it can cause problems when porting affected code to other
interpreters that don't use refcounting and thus have a harder time
implementing such a trick.

Is there anything in the language spec that says string concatenation is
O(n^2)? Or for that matter any of the performs characteristics of build in
types? Those striker as implementation details that SHOULD be particular to
the implementation.

Should we cripple the performance of some operation in Cpython so that it
won't work better that Jython? That seems an odd choice. Then how dare PyPy
make scalar computation faster? People might switch to cPython and not know
they should have been using numpy all along...

It's considered worth the cost, since it dramatically improves the
performance of common naive code in a way that doesn't alter the semantics.

Seems the same argument could be made for sum(list_of_strings).

 > It seems pretty pedantic to say: we could make this work well, but we'd
> rather chide you for not knowing the "proper" way to do it.

Yes, that's exactly what this is - a nudge towards the right way to
concatenate strings without incurring quadratic behaviour.

But if it were optimized, it wouldn't incur quadratic behavior.

We *want* people to learn that distinction, not sweep it under the rug.

But sum() is not inherently quadratic -- that's a limitation of the
implementation. I agree that disallowing it is a good idea given that
behavior, but if it were optimized, there would be no reason to steer
people away.

"".join _could_ be naively written with the same poor performance -- why
should users need to understand why one was optimized and one was not?

That's the other reason the implicit optimisation is controversial - it
hides an important difference in algorithmic complexity from users.

It doesn't hide it -- it eliminates it. I suppose it's good for folks to
understand the implications of string immutability for when they write
their own algorithms, but this wouldn't be considered a good argument for a
poorly performing sort() for instance.

> Practicality beats purity?

Teaching users the difference between linear time operations and quadratic
ones isn't about purity, it's about passing along a fundamental principle
of algorithm scalability.

That is a very import a lesson to learn, sure, but python is not only a
teaching language. People will need to learn those lessons at some point,
this one feature makes little difference.

We do it specifically for strings because they *do* have an optimised
algorithm available that we can point users towards, and concatenating
multiple strings is common.

Sure, but I think all that does is teach people about a cpython specific
implementation -- and I doubt naive users get any closer to understanding
algorithmic complexity -- all they learn is you should use string.join().

Oh well, not really that big a deal.

-Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140811/f80a5496/attachment.html>