[XML-SIG] New data on speed of string appending
tpassin@home.com
tpassin@home.com
Wed, 23 Aug 2000 23:39:58 -0400
Remember back in June (June 22,2000), Bjorn Pettersen showed how using
StringIO instead of just appending to a string was **much** faster? Then
others showed that using list.append/join(list) was also very fast.
Well, a colleague wrote a Python script to convert csv data to a particular
xml format. It took 17 seconds to run on my computer, even though the
output wasn't that long. He was producing the entire output string before
writing any of it. I changed all the str=str+char statements to use
list.append(). The script executed nearly instantaneously - much less than
a second to run.
I decided to do a little checking, and I wanted to share the results. I
tried all three candidate methods:
1)
str=''
for char in data:
str=str+char
2)
l=[]
for char in data:
l=l.append(char)
str=string.join(l)
3)
st=cStringIO()
for char in data:
st.write(char)
str=st.getvalue()
The test program simply builds an output string equal to the input string
one character at a time. The times include the time to create the
string/list/cStringIO object.
The results are dramatic. Method 1) is as good as or better than anything
until the string length exceeds about 1000 bytes. Then Method 1 starts
slowing down. Above about 4000 bytes, it's really getting ssslllooowww.
Here is a table of the results on my system - 450 MHz PIII running Win98,
Python 1.5.2.
Rate of generating output string, char/sec
length of input Method 1 Method 2 Method 3
50-1000 3.3e5 1.8e5 2.3e5
1200 3.2e5 1.8e5 2.6e5
1500 1.2e5 1.8e5 2.5e5
2000 1.2e5 2.7e5 2.6e5
4000 6.1e4 1.8e5 2.6e5
8000 3.6e4 1.9e5 2.5e5
15000 1.7e4 1.4e5 2.5e5
30,000 8200 1.8e5 2.7e5
40,000 6600 1.8e5 2.4e5
60,000 4500 2.1e5 2.2e5
100,000 --- 1.8e5 2.4e5
200,000 --- 1.8e5 2.4e5
These figures include some averaging. The few numbers that are a little
different - like Method 2 at 60,000 char - probably don't mean anything.
Oh, yes, plain StringIO was definitely slower that cStringIO, as you might
think - I dont's have any figures, though.
Below is the test program, followed by parts of Bjorn's post.
Cheers,
Tom Passin
========================================================
import time
import string
import cStringIO
import sys
def Min(x,y):
if x>=y:return y
return x
if __name__=='__main__':
args=sys.argv
if len(args)>1:
file=args[1]
else:
print "No file to test...'
sys.exit(0)
f=open(file)
data=f.read()
f.close()
# Use this variable to set the effective length of the test.
# Or, just make it larger than the input file size.
testlength=300000
data=data[:testlength]
datalen=len(data) # Input might have been shorter than testlength
print "\nTesting Speed of adding characters to strings"
print 'Data consists of %s characters' % datalen
# String appending takes too long for long string sizes,
# so it's better to force the data length to some maximum value.
# We don't force the data size for the other methods (below).
d1len=Min(40000,datalen)
d1=data[:d1len]
d1len=len(d1)
start=time.time()
reps=1
# Set up multiple reps to get a usable test duration
if d1len<5000:
reps=int(20000/d1len)
for n in range(reps):
s=''
for c in d1:
s=s+c # Method 1
duration=time.time()-start
print '\tAdd-to-string for %s characters: %.3g seconds (%.3g char/sec)'\
% (d1len,duration/reps,(reps*d1len/duration))
d1=s=None
print 'Times for %s characters:' % datalen
# Append-to-list method
start=time.time()
reps=1
if datalen<10000:
reps=int(100000/datalen)
for n in range(reps):
l=[]
for c in data:
l.append(c) # Method 2
s=string.join(l)
duration=time.time()-start
print '\tAppend-to-list: %.3g sec (%.3g char/sec)' \
% (duration/reps,(reps*datalen)/duration)
l=[]
# StringIO method
start=time.time()
reps=1
if datalen<10000:
reps=int(100000/datalen)
for n in range(reps):
st=cStringIO.StringIO()
for c in data:
st.write(c) # Method 3
s=st.getvalue()
st.close()
duration=time.time()-start
print '\tcStringIO: %0.3g sec (%.3g char/sec)'\
% (duration/reps,(reps*datalen)/duration)
st=[]; s=''
==================================================
----- Original Message -----
From: "Bjorn Pettersen" <bjorn@roguewave.com>
To: "Lars Marius Garshol" <larsga@garshol.priv.no>
Cc: <xml-sig@python.org>
Sent: Monday, June 05, 2000 8:03 PM
Subject: Re: [XML-SIG] speed question re DOM parsing
> Lars Marius Garshol wrote:
> >
> > * Bjorn Pettersen
> > |
> > | Question: does using StringIO (or perhaps array) and __getattr__
> > | sound like the right thing to do?
> >
> > StringIO sounds like the right thing, at least for that particular
> > document. Probably it wouldn't be too bad for the other documents
> > either, but I have no experience with its performance.
> >
> > I'm afraid I don't have the necessary context to answer the
> > __getattr__ questions, but: I would definitely like to see your
> > sources. If you could post them somewhere, I, at least, would be happy
> > to have a look at them.
>
> I've included the patched file as an attachment. My changes are
> confined to:
>
> - importing (c)StringIO at the top
> - changing the constructor call to _element (line 82) to pass
> a StringIO object rather than an empty string.
> - hiding the "first_cdata" member in the __init__ method of _element
> - adding a __getattr__ method to _element.
>
> With limited performance testing I got:
>
> File Size Original Patched
> 37K 0.14s 0.07s
> 968K 103.77s 1.68s
>