[Python-Dev] file.readinto performance regression in Python 3.2 vs. 2.7?

Fri Nov 25 15:48:00 CET 2011

On 25 November 2011 11:37, Matt Joiner <anacrolix at gmail.com> wrote:
>> No "wtf" here, the read() loop is quadratic since you're building a
>> new, larger, bytes object every iteration.  Python 2 has a fragile
>> optimization for concatenation of strings, which can avoid the
>> quadratic behaviour on some systems (depends on realloc() being fast).
>
> Is there any way to bring back that optimization? a 30 to 100x slow
> down on probably one of the most common operations... string
> contatenation, is very noticeable. In python3.3, this is representing
> a 0.7s stall building a 10MB string. Python 2.7 did this in 0.007s.

It's a fundamental, but sadly not well-understood, consequence of
having immutable strings. Concatenating immutable strings in a loop is
quadratic. There are many ways of working around it (languages like C#
and Java have string builder classes, I believe, and in Python you can
use StringIO or build a list and join at the end) but that's as far as
it goes.

The optimisation mentioned was an attempt (by mutating an existing
string when the runtime determined that it was safe to do so) to hide
the consequences of this fact from end-users who didn't fully
understand the issues. It was relatively effective, but like any such
case (floating point is another common example) it did some level of
harm at the same time as it helped (by obscuring the issue further).

It would be nice to have the optimisation back if it's easy enough to
do so, for quick-and-dirty code, but it is not a good idea to rely on
it (and it's especially unwise to base benchmarks on it working :-))

Paul.