String concatenation performance with +=

Steven D'Aprano steve at pearwood.info
Sat Feb 14 07:33:21 CET 2009


Sammo wrote:

> Okay, this is what I have tried for string concatenation:
> 
> 1. Using += implemented using simple operations (12 ms)
> 2. Using += implemented inside a class (4000+ ms)
> 3. Using "".join implemented using simple operations (4000+ ms)
> 4. Using "".join implemented inside a class (4000+ ms)


I think you're doing more work than you need to. Here are my results:

>>> def join_test():
...     L = []
...     block = "abcdefghijklmnopqrstuvwxyz"*1000
...     for i in xrange(1000):
...             L.append(block)
...     return ''.join(L)
... 
>>> def concat_test():
...     s = ""
...     block = "abcdefghijklmnopqrstuvwxyz"*1000
...     for i in xrange(1000):
...             s += block
...     return s
... 
>>> assert join_test() == concat_test()
>>>
>>> from timeit import Timer
>>> Timer('concat_test()', 
... 'from __main__ import concat_test').repeat(number=100) 
[5.397799015045166, 4.3106229305267334, 4.3492171764373779]
>>> Timer('join_test()', 
... 'from __main__ import join_test').repeat(number=100) 
[4.5378072261810303, 3.9887290000915527, 3.919511079788208]


The join() idiom is slightly faster than repeated concatenation, even with
the optimization. If you're not seeing that result, you need to consider
what extra work you're doing that you don't need to.


>> Then why not just mutate the list and then call "".join?
> 
> AFAIK, using list mutation and "".join only improves performance if
> the "".join is executed outside of the loop. 

Naturally. If you needlessly join over and over again, instead of delaying
until the end, then you might as well do string concatenation without the
optimization.

join() isn't some magical function that makes your bugs go away. You have to
use it sensibly. What you've done is a variant of Shlemiel the
road-painter's algorithm:

http://www.joelonsoftware.com/articles/fog0000000319.html



Perhaps you have to repeatedly do work on the *temporary* results in between
loops? Something like this toy example?

s = ""
block = "abcdefghijklmnopqrstuvwxyz"*1000
for i in xrange(1000):
    s += block
    # do something with the partially concatenated string
    print "partial length is", len(s)
# all finished
do_something(s)

You've written that using join:

L = []
block = "abcdefghijklmnopqrstuvwxyz"*1000
for i in xrange(1000):
    L.append(block)
    # do something with the partially concatenated string
    L = [ ''.join(L) ]
    print "partial length is", len(L[0])
# all finished
s = ''.join(L)
do_something(s)

Okay, but we can re-write that more sensibly:

L = []
block = "abcdefghijklmnopqrstuvwxyz"*1000
for i in xrange(1000):
    L.append(block)
    # do something with the partially concatenated string
    print "partial length is", sum(len(item) for item in L)
# all finished
s = ''.join(L)
do_something(s)

There's still a Shlemiel algorithm in there, but it's executed in fast C
instead of slow Python and it doesn't involve copying large blocks of
memory, only walking them, so you won't notice as much. Can we be smarter?

L = []
block = "abcdefghijklmnopqrstuvwxyz"*1000
partial_length = 0
for i in xrange(1000):
    L.append(block)
    partial_length += len(block)
    # do something with the partially concatenated string
    print "partial length is", partial_length
# all finished
s = ''.join(L)
do_something(s)


Naturally this is a toy example, but I think it illustrates one technique
for avoiding turning an O(N) algorithm into an O(N**2) algorithm. 

Good luck!


-- 
Steven




More information about the Python-list mailing list