[Tutor] Joining all strings in stringList into one string
Steven D'Aprano
steve at pearwood.info
Sat Jun 2 16:29:07 CEST 2012
Jordan wrote:
> #Another version might look like this:
>
> def join_strings2(string_list):
> final_string = ''
> for string in string_list:
> final_string += string
> print(final_string)
> return final_string
Please don't do that. This risks becoming slow. REALLY slow. Painfully slow.
Like, 10 minutes versus 3 seconds slow. Seriously.
The reason for this is quite technical, and the reason why you might not
notice is even more complicated, but the short version is this:
Never build up a long string by repeated concatenation of short strings.
Always build up a list of substrings first, then use the join method to
assemble them into one long string.
Why repeated concatenation is slow:
Suppose you want to concatenation two strings, "hello" and "world", to make a
new string "helloworld". What happens?
Firstly, Python has to count the number of characters needed, which
fortunately is fast in Python, so we can ignore it. In this case, we need 5+5
= 10 characters.
Secondly, Python sets aside enough memory for those 10 characters, plus a
little bit of overhead: ----------
Then it copies the characters from "hello" into the new area: hello-----
followed by the characters of "world": helloworld
and now it is done. Simple, right? Concatenating two strings is pretty fast.
You can't get much faster.
Ah, but what happens if you do it *repeatedly*? Suppose we have SIX strings we
want to concatenate, in a loop:
words = ['hello', 'world', 'foo', 'bar', 'spam', 'ham']
result = ''
for word in words:
result = result + word
How much work does Python have to do?
Step one: add '' + 'hello', giving result 'hello'
Python needs to copy 0+5 = 5 characters.
Step two: add 'hello' + 'world', giving result 'helloworld'
Python needs to copy 5+5 = 10 characters, as shown above.
Step three: add 'helloworld' + 'foo', giving 'helloworldfoo'
Python needs to copy 10+3 = 13 characters.
Step four: add 'helloworldfoo' + 'bar', giving 'helloworldfoobar'
Python needs to copy 13+3 = 16 characters.
Step five: add 'helloworldfoobar' + 'spam', giving 'helloworldfoobarspam'
Python needs to copy 16+4 = 20 characters.
Step six: add 'helloworldfoobarspam' + 'ham', giving 'helloworldfoobarspamham'
Python needs to copy 20+3 = 23 characters.
So in total, Python has to copy 5+10+13+16+20+23 = 87 characters, just to
build up a 23 character string. And as the number of loops increases, the
amount of extra work needed just keeps expanding. Even though a single string
concatenation is fast, repeated concatenation is painfully SLOW.
In comparison, ''.join(words) one copies each substring once: it counts out
that it needs 23 characters, allocates space for 23 characters, then copies
each substring into the right place instead of making a whole lot of temporary
strings and redundant copying.
So, join() is much faster than repeated concatenation. But you may never have
noticed. Why not?
Well, for starters, for small enough pieces of data, everything is fast. The
difference between copying 87 characters (the slow way) and 23 characters (the
fast way) is trivial.
But more importantly, some years ago (Python 2.4, about 8 years ago?) the
Python developers found a really neat trick that they can do to optimize
string concatenation so it doesn't need to repeatedly copy characters over and
over and over again. I won't go into details, but the thing is, this trick
works well enough that repeated concatenation is about as fast as the join
method MOST of the time.
Except when it fails. Because it is a trick, it doesn't always work. And when
it does fail, your repeated string concatenation code will suddenly drop from
running in 0.1 milliseconds to a full second or two; or worse, from 20 seconds
to over an hour. (Potentially; the actual slow-down depends on the speed of
your computer, your operating system, how much memory you have, etc.)
Because this is a cunning trick, it doesn't always work, and when it doesn't
work, and you have slow code and no hint as to why.
What can cause it to fail?
- Old versions of Python, before 2.4, will be slow.
- Other implementations of Python, such as Jython and IronPython, will not
have the trick, and so will be slow.
- The trick is highly-dependent on internal details of the memory management
of Python and the way it interacts with the operating system. So what's fast
under Linux may be slow under Windows, or the other way around.
- The trick is highly-dependent on specific circumstances to do with the
substrings being added. Without going into details, if those circumstances are
violated, you will have slow code.
- The trick only works when you are adding strings to the end of the new
string, not if you are building it up from the beginning.
So even though your function works, you can't rely on it being fast.
--
Steven
More information about the Tutor
mailing list