speed problems

Wed Jun 9 12:16:26 EDT 2004

In article <639f17f8.0406090728.4e4e0bc6 at posting.google.com>,
 nholtz at docuweb.ca (Neal Holtz) wrote:

> 
> I've halved the python time on my test by changing the entire inner loop
> to:
> 
>         pat = re.compile( "MALWARE:\s+(.*)" )
>         for line in lf:
>             mo = pat.search( line )
>             if mo:
>                 for vnam in mo.group(1).split( ", "):
>                     virstat[vnam] = virstat.get(vnam,0) + 1
>                     total += 1
>         lf.close()

A few random thoughts...

1) How often is mo true?  In other words, what percentage of the lines 
in the file match the pattern?  If it's very high or very low, that 
might give you some ideas where to look.  Running the profiler will help 
at lot too!

2) It's probably a minor tweak, but you could factor out the 
"pat.search" name lookup cost by doing something like:

pat = re.compile( "MALWARE:\s+(.*)" )
patSearch = pat.search
for line in lf:
   mo = patSearch (line)
   ....

3) There's a certain amount of duplication of effort going on between 
the regex search and the split; two passes over the same data.  Is it 
possible to write a regex which parses it all in one pass?  Of course, 
if the answer to question #1 is "very low percentage", then this is 
probably not the place to be looking.

4) What does virstat.get do?

Lastly, a general programming style comment.  I'm a fan of longish 
variable names which describe what they're doing.  I had to think a bit 
to figure out that vnam probably stands for "virus name".  It would have 
been easier for me to figure out if you named the variable something 
like virusName (or even virus_name, if you're an underscore fan).  Same 
with the other variable names.