looking for speed-up ideas

Ram Bhamidipaty ramb at sonic.net
Tue Feb 4 01:48:57 EST 2003


Andrew Dalke <adalke at mindspring.com> writes:

> Ram Bhamidipaty wrote:
> > I have some python code that processes a large file. I want to see how
> > much faster this code can get. Mind you, I don't _need_ the code to go
> > faster - but it sure would be nice if it were faster...
> 
> Don't create the FileSize object.  Use a simple tuple instead.  With
> an object you have higher overheads to create the object and to make
> the comparison.
> 
> Try this.  I don't have heap so I do a sort and cut every once in a
> while.  It also doesn't do full error checking in case the input isn't
> in the right format.  And it uses a more recent version of Python than
> the code you have (eg, no need for xreadlines)
> 
> This should be quite fast.
> 
> ...

> 					Andrew
> 					dalke at dalkescientific.com




Wow. Your code is a lot nicer than mine! Here is what I got from the
code without the heapqc support:

300klines in 33 seconds.

With the heapqc module for 300k lines = 34 seconds.

Hmm. Something is definately wrong. Your script _should_ be faster.

Here is what it looks like after I added the heap module:
There is some kind of bug with the heappop routine not
returning a tupple. The output I am getting look like each
element of each tuple is being returned by heappop:

(this comes from the 300k line version of my data file).
ud_sim.o
-567988
ISBW000M1.s
0
ISBW000M2.m

...


----------------------------------------------------------
#!/remote/espring/ramb/tools/bin/python

import sys, profile, imp

m = imp.find_module("heapqc", ["/remote/TMhome/ramb/src/Downloaded/py_heap"])
if not m:
    print "Unable to load heap module"
    sys.exit(0)
mod = imp.load_module("heapqc", m[0], m[1], m[2])
if not mod:
    print "module load failed"
import heapqc


def process(infile):
     dirid_info = {}

     line = infile.readline()
     assert line[:1] == "T"
     ignore, dirname, dirid = line.split()
     dirid_info[dirid] = (None, dirname)

     fileinfo = []

     for line in infile:
         if line[:1] == "F":
             ignore, size, name = line.split("/")
             # negate size so 'largest' is sorted first
             if len(fileinfo) < 200:
                 fileinfo.append( (-long(size), dirid, name) )
                 if len(fileinfo) == 200:
                     heapqc.heapify(fileinfo)
             else:
                 heapqc.heapreplace(fileinfo, (-long(size), dirid, name) )
         else:
             ignore, dirname, parent_id, dirid = line[:-1].split("/")
             dirid_info[dirid] = (parent_id, dirname)

     print "len fileinfo = ", len(fileinfo)
     while len(fileinfo) > 0:
         size, dirid, name = heapqc.heappop(fileinfo)
         print name, size

     sys.exit(0)

     for size, dirid, name in heapqc.heappop(fileinfo):
         size = -size
         components = [name[:-1]]  # need to chop newline
         while dirid != None:
             dirid, dirname = dirid_info[dirid]
             components.append(dirname)
         components.reverse()
         print size, "/".join(components)

f = file(sys.argv[1], "r")
profile.run("process(f)")




More information about the Python-list mailing list