[pypy-dev] [Python-Dev] efficient string concatenation (yep, from 2004)

Nick Coghlan ncoghlan at gmail.com
Wed Feb 13 15:44:09 CET 2013


On Wed, Feb 13, 2013 at 10:06 PM, Christian Tismer <tismer at stackless.com> wrote:
> To avoid such hidden traps in larger code bases, documentation is
> needed that clearly gives a warning saying "don't do that", like CS
> students learn for most other languages.

How much more explicit do you want us to be?

"""6. CPython implementation detail: If s and t are both strings, some
Python implementations such as CPython can usually perform an in-place
optimization for assignments of the form s = s + t or s += t. When
applicable, this optimization makes quadratic run-time much less
likely. This optimization is both version and implementation
dependent. For performance sensitive code, it is preferable to use the
str.join() method which assures consistent linear concatenation
performance across versions and implementations."""

from http://docs.python.org/2/library/stdtypes.html#typesseq

So please don't blame us for people not reading a warning that is already there.

Since my rewrite of the sequence docs, Python 3 doesn't even
acknowledge the hack's existence and is quite explicit about what you
need to do to get reliably linear behaviour:

"""6. Concatenating immutable sequences always results in a new
object. This means that building up a sequence by repeated
concatenation will have a quadratic runtime cost in the total sequence
length. To get a linear runtime cost, you must switch to one of the
alternatives below:

    if concatenating str objects, you can build a list and use
str.join() at the end or else write to a io.StringIO instance and
retrieve its value when complete
    if concatenating bytes objects, you can similarly use bytes.join()
or io.BytesIO, or you can do in-place concatenation with a bytearray
object. bytearray objects are mutable and have an efficient
overallocation mechanism
    if concatenating tuple objects, extend a list instead
    for other types, investigate the relevant class documentation"""

from http://docs.python.org/3/library/stdtypes.html#common-sequence-operations

Deliberately *relying* on the += hack to avoid quadratic runtime is
just plain wrong, and our documentation already says so.

If anyone really thinks it will help, I can add a CPython
implementation note back in to the Python 3 docs as well, pointing out
that CPython performance measurements may hide broken algorithmic
complexity related to string concatenation, but the corresponding note
in Python 2 doesn't seem to have done much good :P

Regards,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the pypy-dev mailing list