looking for speed-up ideas
Ram Bhamidipaty
ramb at sonic.net
Tue Feb 4 01:48:57 EST 2003
Andrew Dalke <adalke at mindspring.com> writes:
> Ram Bhamidipaty wrote:
> > I have some python code that processes a large file. I want to see how
> > much faster this code can get. Mind you, I don't _need_ the code to go
> > faster - but it sure would be nice if it were faster...
>
> Don't create the FileSize object. Use a simple tuple instead. With
> an object you have higher overheads to create the object and to make
> the comparison.
>
> Try this. I don't have heap so I do a sort and cut every once in a
> while. It also doesn't do full error checking in case the input isn't
> in the right format. And it uses a more recent version of Python than
> the code you have (eg, no need for xreadlines)
>
> This should be quite fast.
>
> ...
> Andrew
> dalke at dalkescientific.com
Wow. Your code is a lot nicer than mine! Here is what I got from the
code without the heapqc support:
300klines in 33 seconds.
With the heapqc module for 300k lines = 34 seconds.
Hmm. Something is definately wrong. Your script _should_ be faster.
Here is what it looks like after I added the heap module:
There is some kind of bug with the heappop routine not
returning a tupple. The output I am getting look like each
element of each tuple is being returned by heappop:
(this comes from the 300k line version of my data file).
ud_sim.o
-567988
ISBW000M1.s
0
ISBW000M2.m
...
----------------------------------------------------------
#!/remote/espring/ramb/tools/bin/python
import sys, profile, imp
m = imp.find_module("heapqc", ["/remote/TMhome/ramb/src/Downloaded/py_heap"])
if not m:
print "Unable to load heap module"
sys.exit(0)
mod = imp.load_module("heapqc", m[0], m[1], m[2])
if not mod:
print "module load failed"
import heapqc
def process(infile):
dirid_info = {}
line = infile.readline()
assert line[:1] == "T"
ignore, dirname, dirid = line.split()
dirid_info[dirid] = (None, dirname)
fileinfo = []
for line in infile:
if line[:1] == "F":
ignore, size, name = line.split("/")
# negate size so 'largest' is sorted first
if len(fileinfo) < 200:
fileinfo.append( (-long(size), dirid, name) )
if len(fileinfo) == 200:
heapqc.heapify(fileinfo)
else:
heapqc.heapreplace(fileinfo, (-long(size), dirid, name) )
else:
ignore, dirname, parent_id, dirid = line[:-1].split("/")
dirid_info[dirid] = (parent_id, dirname)
print "len fileinfo = ", len(fileinfo)
while len(fileinfo) > 0:
size, dirid, name = heapqc.heappop(fileinfo)
print name, size
sys.exit(0)
for size, dirid, name in heapqc.heappop(fileinfo):
size = -size
components = [name[:-1]] # need to chop newline
while dirid != None:
dirid, dirname = dirid_info[dirid]
components.append(dirname)
components.reverse()
print size, "/".join(components)
f = file(sys.argv[1], "r")
profile.run("process(f)")
More information about the Python-list
mailing list