StringType add operation seems every inefficient
Jeremy Hylton
jeremy at alum.mit.edu
Tue Feb 13 10:00:14 EST 2001
>>>>> "RL" == Lee, Rick <rickylee at americasm01.nt.com> writes:
RL> If I do something like this:
RL> s = ''
RL> for i in listOfStrings:
RL> s = s + i
RL> Once listOfStrings is say 10000 members of 100 bytes, it takes
RL> forever (ie. minutes) for the "for" loop to finish. This is
RL> running on a 700MHz WinNT machine, Python 2.0
RL> Contrast with:
RL> s = ''
RL> blocks = []
RL> for i in listOfStrings:
RL> s = s + i if len(s) > 4000
RL> blocks.append(s); s = ''
RL> This takes a blink of the eye to finish, so it is faster than
RL> the first piece of code by between 100 and 1000 times.
RL> Is this relative inefficiency in the string add operation
RL> expected?
It is expected, because the string add version is doing something
quite different than accumulator version. The 's = s + i' line is
making a new string for each iteration of the loop. To make the new
string, the interpreter allocates a new string object and copies the
contents of s and i into it.
It may help you to see what is happening if you unroll the loop by
hand:
s0 = ''
s1 = s0 + listOfStrings[0]
s2 = s1 + listOfStrings[1]
# ~= s0 + listOfStrings[0] + listOfStrings[1]
s3 = s2 + listOfStrings[2]
# ~= s0 + listOfStrings[0] + listOfStrings[1] + listOfStrings[2]
In the end, that's O(N) string allocations and O(N**2) string copies,
where N is the length of listOfStrings.
RL> I eventually have to produce a big long string of approx. 10MB.
RL> What is the fastest way to do it without having to write
RL> C-extensions?
The second technique you show -- using a list to accumulate the
individual strings -- is a standard Python idiom. At the end, you do
"".join(blocks) or somesuch to produce a single string. This approach
does exactly one copy for each constituent string.
I believe someone also suggested using cStringIO, which is probably
just as fast.
Jeremy
More information about the Python-list
mailing list