O(n^2) is bad - can it be fixed?

Tue May 22 03:31:42 EDT 2001

"Helen Dawson" <helend at accessone.com> wrote in message
news:3B09FBF9.8691EBFE at accessone.com...
> The following script tests appending to a list and to a string.
>
> def AppendTest(listorstring):
>     for i in range(0, 100):
>         for x in range(0, 2000):
>             listorstring += " "
>         print i
>
> AppendTest([])
> AppendTest("")
>
> The test with the list runs very swiftly and smoothly - the numbers
> are printed out at a steady rate. The test with the string runs
> slowly, and the speed that the numbers appear slows down - a lot. In
> fact, I have never seen the string test end - I always abort it out
> of boredom before it gets to twenty five.
>
> Without even looking in the Python internals it is obvious what is
> happening. In the string case, each time a character is appended the
> entire string is copied. Therefore adding n characters requires

Right.  String objects are immutable, so it couldn't possibly
be otherwise.

> n^2/2 bytes to be copied, making processing large files (where large
> is > about 10K) very inefficient - and processing files larger
> than about 100K is virtually impossible.
>
> List does much better. Why? It probably does the same thing that the
> STL vector typically does - the allocabe space doubles whenever it
> runs out of room, and then the newly allocated space is filled in
> as needed.

Not quite -- Python's sources are pretty clear about that:-).  A
list DOES allocate a few extra entries when it grows, but it does
NOT grow geometrically.  So IN THEORY appending should be O(N^2)
for lists as well -- but in practice it seems N never gets big
enough to actually measure that effect.

> So, why can't string do that?

A string object is immutable.  So, what would it do with any
extra allocated space?

> When we could only go string = string + " " then fixing this problem
> was probably impossible, but with += it seems like it should be quite
> possible - maybe even easy!

Since a string object is immutable, it doesn't define "in-place
addition" -- thus the semantics of the two forms are identical.

Python's engine *MIGHT* perhaps special-case the occasions where
a string object is not interned AND there is just one reference
to it, but currently the engine doesn't do ANY "under the covers
black magic optimization", even for much more relevant issues
such as calling methods on objects, &c.  The lack of any tricky
optimization keeps the engine simple, lightweight, etc, etc.

> Right now if I am producing a text file a character at a time I have
> to do it to a list and then use string.join() to convert at the end.
> That works, but seems clunky.

Clunkiness is in the eye of the beholder.  Others would consider
"mutable strings" as clunkier, and Java's "solution" of having
separate "string" (immutable) and "string buffer" (mutable) types,
while feasible, hasn't won it great praise either.

> It's also a landmine for the naive. I recognized the problem
> immediately, but many people would simply assume that Python was slow,
> when it took an hour to process a one Mbyte file.

Education seems a better answer to that than black-magic
under-the-cover optimizations that would break as soon as
one had a second reference extant to a string object:-).

> I'm using Python 2.1, on Windows.
>
> Bruce Dawson.
>
> P.S. Is this a known issue?

Yes.  http://www.python.org/doc/essays/list2str.html mentions it,
for example, as does FAQ 4.7 (http://www.python.org/doc/FAQ.html).

Alex