Speed of string += string
Tim Peters
tim_one at email.msn.com
Sat Apr 12 21:22:13 EDT 2003
[Roy Smith]
> One of the very few things I like about C++ over Python is that the STL
> is very careful to document the complexity of the various operations.
> AFAICT, no such documentation exists for Python.
That's so. It will be repaired as soon as someone volunteers to write the
docs <wink>.
> True, it's documented that strings are immutable, and from that you
> should be able to infer that string addition requires lots of creating
> and destroying of objects, which sounds like it should work out to
> being quadradic, but that's not quite the same as having the
> algorithmic complexity spelled out for you.
Not the same at all. For example, string[i:j] in Python takes time
proportional to j-i, but that doesn't *follow* from immutability (there are
more complicated implementation schemes than Python uses that would allow
faster slicing -- in return for losing elsewhere).
> As another example, I can guess that Python dictionaries work like STL
> maps on the inside, but that would be just a guess. Is inserting an
> element logN?
In all versions of CPython released to date, expected time O(1), worst O(N),
and no amoritized bound guaranteed better than O(N). If you supply a
particularly poor __hash__ implementation, or have spectacularly unlucky key
distributions, it can be expected-case O(N). Note that these don't count
the time for hashing the key. If, e.g., the key is a string s, and s isn't
interned, and s hasn't been hashed before, then hash(s) takes O(len(s)) time
by itself; but if s has been hashed before, hash(s) takes constant time (its
hash code is cached in the string object). So the full story can be quite
involved.
> Is keys() linear?
Yes, and that's both best and worst case.
> Is del on an element logN? Is has_key() logN?
Both the same as insert.
> Those would be my guesses. But they're just guesses.
Now you know. The *language* doesn't guarantee any of this, though, and
probably never will.
Guido worked on an earlier language (ABC) that used implementation
techniques with good worst-case bounds, chiefly balanced binary trees. But
the overhead of those schemes was a major drag on performance for most apps
most of the time. Python's implementation is generally more pragmatic,
aiming for simpler schemes with excellent normal-case performance, not
worrying much about worst-case performance.
For example, until Python 2.3, the worst-case performance of list.sort() was
quadratic (although in 1.5.2 that was extremely difficult to provoke even
with a deep understanding of the 1.5.2 samplesort's implementation; it was
easy to provoke in pre-1.5.2 quicksort variants).
Python 2.3 has a new worst-case N log N sort, not because it has better
worst-case performance, but because the 2.3 sort proved much faster than the
2.2 sort in many common cases where the list already has significant
pre-existing order, and proved no slower than the 2.2 sort when the input
data is randomly ordered. Worst-case analysis played no significant role in
deciding to switch to the new sort implementation, and I expect we would
have switched even if the new sort's worst-case performance were cubic (but
exceedingly rare).
More information about the Python-list
mailing list