[XML-SIG] New data on speed of string appending

tpassin@home.com tpassin@home.com
Wed, 23 Aug 2000 23:39:58 -0400


Remember back in June (June 22,2000), Bjorn Pettersen showed how using
StringIO instead of just appending to a string was **much** faster?  Then
others showed that using list.append/join(list) was also very fast.

Well, a colleague wrote a Python script to convert csv data to a particular
xml format.  It took 17 seconds to run on my computer, even though the
output wasn't that long.  He was producing the entire output string before
writing any of it.  I changed all the str=str+char statements to use
list.append().  The script executed nearly instantaneously - much less than
a second to run.

I decided to do a little checking, and I wanted to share the results.  I
tried all three candidate methods:
1)
str=''
for char in data:
   str=str+char

2)
l=[]
for char in data:
   l=l.append(char)
str=string.join(l)

3)
st=cStringIO()
for char in data:
   st.write(char)
str=st.getvalue()

The test program simply builds an output string equal to the input string
one character at a time. The times include the time to create the
string/list/cStringIO object.

The results are dramatic.  Method 1) is as good as or better than anything
until the string length exceeds about 1000 bytes.  Then Method 1 starts
slowing down.  Above about 4000 bytes, it's really getting ssslllooowww.
Here is a table of the results on my system - 450 MHz PIII running Win98,
Python 1.5.2.

      Rate of generating output string, char/sec
length of input    Method 1    Method 2    Method 3
    50-1000            3.3e5        1.8e5            2.3e5
    1200                3.2e5        1.8e5            2.6e5
    1500                1.2e5        1.8e5            2.5e5
    2000                1.2e5        2.7e5            2.6e5
    4000                6.1e4        1.8e5            2.6e5
    8000                3.6e4        1.9e5            2.5e5
    15000               1.7e4        1.4e5            2.5e5
    30,000               8200        1.8e5            2.7e5
    40,000               6600        1.8e5            2.4e5
    60,000               4500        2.1e5            2.2e5
    100,000             ---            1.8e5            2.4e5
    200,000             ---            1.8e5            2.4e5

These figures include some averaging.  The few numbers that are a little
different - like Method 2 at 60,000 char - probably don't mean anything.
Oh, yes, plain StringIO was definitely slower that cStringIO, as you might
think - I dont's have any figures, though.

Below is the test program, followed by parts of Bjorn's post.

Cheers,

Tom Passin
========================================================
import time
import string
import cStringIO
import sys

def Min(x,y):
    if x>=y:return y
    return x

if __name__=='__main__':
 args=sys.argv
 if len(args)>1:
    file=args[1]
 else:
    print "No file to test...'
    sys.exit(0)
 f=open(file)
 data=f.read()
 f.close()

 # Use this variable to set the effective length of the test.
 # Or, just make it larger than the input file size.
 testlength=300000
 data=data[:testlength]
 datalen=len(data)          # Input might have been shorter than testlength
 print "\nTesting Speed of adding characters to strings"
 print 'Data consists of %s characters' % datalen

 # String appending takes too long for long string sizes,
 # so it's better to force the data length to some maximum value.
 # We don't force the data size for the other methods (below).
 d1len=Min(40000,datalen)
 d1=data[:d1len]
 d1len=len(d1)
 start=time.time()
 reps=1
 # Set up multiple reps to get a usable test duration
 if d1len<5000:
  reps=int(20000/d1len)
 for n in range(reps):
  s=''
  for c in d1:
   s=s+c                    # Method 1
 duration=time.time()-start
 print '\tAdd-to-string for %s characters: %.3g seconds (%.3g char/sec)'\
    % (d1len,duration/reps,(reps*d1len/duration))
 d1=s=None

 print 'Times for %s characters:' % datalen
 # Append-to-list method
 start=time.time()
 reps=1
 if datalen<10000:
  reps=int(100000/datalen)
 for n in range(reps):
  l=[]
  for c in data:
   l.append(c)             # Method 2
  s=string.join(l)
 duration=time.time()-start
 print '\tAppend-to-list: %.3g sec (%.3g char/sec)' \
    % (duration/reps,(reps*datalen)/duration)
 l=[]

 # StringIO method
 start=time.time()
 reps=1
 if datalen<10000:
  reps=int(100000/datalen)
 for n in range(reps):
  st=cStringIO.StringIO()
  for c in data:
   st.write(c)           # Method 3
  s=st.getvalue()
  st.close()
 duration=time.time()-start
 print '\tcStringIO: %0.3g sec  (%.3g char/sec)'\
    % (duration/reps,(reps*datalen)/duration)
 st=[]; s=''

==================================================

----- Original Message -----
From: "Bjorn Pettersen" <bjorn@roguewave.com>
To: "Lars Marius Garshol" <larsga@garshol.priv.no>
Cc: <xml-sig@python.org>
Sent: Monday, June 05, 2000 8:03 PM
Subject: Re: [XML-SIG] speed question re DOM parsing


> Lars Marius Garshol wrote:
> >
> > * Bjorn Pettersen
> > |
> > | Question: does using StringIO (or perhaps array) and __getattr__
> > | sound like the right thing to do?
> >
> > StringIO sounds like the right thing, at least for that particular
> > document. Probably it wouldn't be too bad for the other documents
> > either, but I have no experience with its performance.
> >
> > I'm afraid I don't have the necessary context to answer the
> > __getattr__ questions, but: I would definitely like to see your
> > sources. If you could post them somewhere, I, at least, would be happy
> > to have a look at them.
>
> I've included the patched file as an attachment.  My changes are
> confined to:
>
>   - importing (c)StringIO at the top
>   - changing the constructor call to _element (line 82) to pass
>     a StringIO object rather than an empty string.
>   - hiding the "first_cdata" member in the __init__ method of _element
>   - adding a __getattr__ method to _element.
>
> With limited performance testing I got:
>
>   File Size  Original  Patched
>     37K        0.14s    0.07s
>    968K      103.77s    1.68s
>