[Tutor] Re: concat vs join - followup

Andrei project5 at redrival.net
Sun Aug 29 00:03:23 CEST 2004


Kent Johnson wrote on Sat, 28 Aug 2004 00:02:06 -0400:

> A couple of people have made good points about my post comparing string 
> concatenation and join.
<snip>
> Based on this experiment I would say that if the total number of characters 
> is less than 500-1000, concatenation is fine. For anything bigger, use join.

I don't think of building a string list just in order to use join on a
handful of strings, along the lines of "a" + "b". Join is preferred over
string concatenation for reasons of speed in a different circumstance,
which I'll discuss below.

I will argue that string concatenation is not the best solution in the
majority of the cases, nor is join. But I think concatenation will in RL
cases virtually always be beaten either by join or by format on grounds of
both speed and readability.

An important aspect of concatenation is that it will in many cases not be
the best way to solve the issue:

1. it might seem useful quite often for appending filename to path, but
os.path.join is in that case more likely right solution.

2. it might seem useful for 'formatting' output, in which case format
strings are the best (clearest and most maintenance-friendly) solution.
This is even though for really simple cases, string formatting is slightly
slower than concatenation (~10%):
>>> w = timeit.Timer('s, t = "John", "Smith"\nu = "Name: %s %s" % (s, t)')
>>> min(w.repeat(100, 10000))
0.010369194967552176
>>> x = timeit.Timer('s, t = "John", "Smith"\nu = "Name: " + s + " " + t')
>>> min(x.repeat(100, 10000))
0.0091939059293508762

For more difficult cases string formatting beats concatenation hands-down:

>>> y = timeit.Timer('s, t, u, v, w = "John", "Smith", 45, 2, 1973\nz = "Name: %s %s\\nAge: %s\\nKids: %s\\nYear of birth: %s" % (s, t, u, v, w)')
>>> min(y.repeat(50, 1000))
0.0032283432674375945
>>> z = timeit.Timer('s, t, u, v, w = "John", "Smith", 45, 2, 1973\nz = "Name: "+s+" "+t+"\\nAge: "+str(u)+"\\nKids: "+str(v)+"\\nYear of birth: "+str(w)')
>>> min(z.repeat(50, 1000))
0.0055814356292103184

3. people would, given some arbitrary list of strings, be tempted to append
them to each other with a loop, where a join is simply clearer.

The huge speed advantage of join() comes from the fact that strings rather
often come in lists. If I'm asking my storage system for all last names,
I'll get a list of strings. A listdir will give a list of filenames.
Readlines gives a list. split() gives a list. Etc. Now we hit on the third
point I mentioned above, looping over an existing list of strings versus
the time consumed to build one:

>>> unit = "0123456789"
>>> def timeTwo(fn, count):
...     setup = "from __main__ import %s, unit" % fn.__name__
...     stmt = "%s(%d)" % (fn.__name__, count)
...     t = timeit.Timer(stmt, setup)
...     secs = min(t.repeat(50, 1000))
...     return secs
...
>>> def pureAppend(count):
...     s = []
...     for i in range(count):
...         s.append(unit)
...     return s
...     
>>> def appendAndJoin(count):
...     s = []
...     for i in range(count):
...         s.append(unit)
...     return "".join(s)
...     
>>> def pureAdd(count):
...     s = ""
...     for i in range(count):
...         s += unit
...     return s
...     
>>> unit = "0123456789"
>>> timeTwo(pureAppend, 500)
0.32285667591827405
>>> timeTwo(appendAndJoin, 500)
0.36274442701517273
>>> timeTwo(pureAdd, 500)
0.94211862122165257
>>> timeTwo(pureAppend, 10)
0.0086550106228742152
>>> timeTwo(appendAndJoin, 10)
0.009797893307677441
>>> timeTwo(pureAdd, 10)
0.0062792896860628389

The join method is just a relatively small part (about 20%) of the total
execution time. It's the appending that consumes the time. This means that
join still beats string concatenation with two hands tied behind its back
if the strings list already exists, even if its items are few and short:

>>> def timeThree(fn):
...     setup = "from __main__ import %s, l" % fn.__name__
...     stmt = "%s()" % fn.__name__
...     t = timeit.Timer(stmt, setup)
...     secs = min(t.repeat(50, 1000))
...     return secs
...     
>>> l = ["0123456789" for i in range(10)]
>>> def doJoin():
...     mylist = l
...     return "".join(mylist)
...     
>>> def doAdd():
...     mylist = l
...     s = ""
...     for i in range(len(mylist)):
...         s += mylist[i]
...     return s
...     
>>> timeThree(doJoin)
0.0016219938568156067
>>> timeThree(doAdd)
0.0067637087952334696
>>> l = ["012" for i in range(10)]
>>> timeThree(doJoin)
0.0014554922483966948
>>> timeThree(doAdd)
0.0064703754248967016
>>> l = ["012" for i in range(1000)]
>>> timeThree(doJoin)
0.055253111778256425
>>> timeThree(doAdd)
1.2762932160371747



-- 
Yours,

Andrei

=====
Real contact info (decode with rot13):
cebwrpg5 at jnanqbb.ay. Fcnz-serr! Cyrnfr qb abg hfr va choyvp cbfgf. V ernq
gur yvfg, fb gurer'f ab arrq gb PP.



More information about the Tutor mailing list