Help with script with performance problems
Peter Otten
__peter__ at web.de
Sun Nov 23 16:04:36 EST 2003
Dennis Roberts wrote:
> I have a script to parse a dns querylog and generate some statistics.
> For a 750MB file a perl script using the same methods (splits) can
> parse the file in 3 minutes. My python script takes 25 minutes. It
> is enough of a difference that unless I can figure out what I did
> wrong or a better way of doing it I might not be able to use python
> (since most of what I do is parsing various logs). The main reason to
> try python is I had to look at some early scripts I wrote in perl and
> had no idea what the hell I was thinking or what the script even did!
> After some googling and reading Eric Raymonds essay on python I jumped
> in:) Here is my script. I am looking for constructive comments -
> please don't bash my newbie code.
Below is my version of your script. It tries to use more idiomatic Python
and is about 20%t faster on some bogus data - but nowhere near to close the
performance gap you claim to the perl script.
However, it took 143 seconds to process 10**7 lines generated by
<makesample.py>
import itertools, sys
sample = "%dmonth day time stype source%d#sowhat qtype %dquery ctype record"
thousand = itertools.cycle(range(1000))
hundred = itertools.cycle(range(100))
out = file(sys.argv[1], "w")
try:
try:
count = int(sys.argv[2])
except IndexError:
count = 10**7
for i in range(count):
print >> out, sample % (i, thousand.next(), hundred.next())
finally:
out.close()
</makesample.py>
with Python 2.3.2 on my 2.6GHz P4. Would that mean Perl would do it in 17
seconds? Anyway, the performance problem would rather be your computer :-),
Python should be fast enough for the purpose.
Peter
<parselog.py>
#!/usr/bin/python -u
#Warning, not seriously tested
import sys
#import time
#starttime = time.time()
clients = {}
queries = {}
lineNo = -1
threshold = 100
pointmod = 100000
f = file(sys.argv[1])
try:
print "Each dot is %d lines..." % pointmod
for lineNo, line in enumerate(f):
if lineNo % pointmod == 0:
sys.stdout.write(".")
try:
month, day, timestr, stype, source, qtype, query, ctype, record
= line.split()
except ValueError:
raise Exception("problem splitting line %d\n%s" % (lineNo,
line))
source = source.split('#', 1)[0]
clients[source] = clients.get(source, 0) + 1
queries[query] = queries.get(query, 0) + 1
finally:
f.close()
print
print lineNo+1, "lines processed"
for numclient, count in clients.iteritems():
if count > threshold:
print "%s,%s" % (numclient, count)
for numquery, count in queries.iteritems():
if count > threshold:
print "%s,%s" % (numquery, count)
#print "time:", time.time() - starttime
</parselog.py>
More information about the Python-list
mailing list