roy at panix.com
Wed Jun 9 18:16:26 CEST 2004
In article <639f17f8.0406090728.4e4e0bc6 at posting.google.com>,
nholtz at docuweb.ca (Neal Holtz) wrote:
> I've halved the python time on my test by changing the entire inner loop
> pat = re.compile( "MALWARE:\s+(.*)" )
> for line in lf:
> mo = pat.search( line )
> if mo:
> for vnam in mo.group(1).split( ", "):
> virstat[vnam] = virstat.get(vnam,0) + 1
> total += 1
A few random thoughts...
1) How often is mo true? In other words, what percentage of the lines
in the file match the pattern? If it's very high or very low, that
might give you some ideas where to look. Running the profiler will help
at lot too!
2) It's probably a minor tweak, but you could factor out the
"pat.search" name lookup cost by doing something like:
pat = re.compile( "MALWARE:\s+(.*)" )
patSearch = pat.search
for line in lf:
mo = patSearch (line)
3) There's a certain amount of duplication of effort going on between
the regex search and the split; two passes over the same data. Is it
possible to write a regex which parses it all in one pass? Of course,
if the answer to question #1 is "very low percentage", then this is
probably not the place to be looking.
4) What does virstat.get do?
Lastly, a general programming style comment. I'm a fan of longish
variable names which describe what they're doing. I had to think a bit
to figure out that vnam probably stands for "virus name". It would have
been easier for me to figure out if you named the variable something
like virusName (or even virus_name, if you're an underscore fan). Same
with the other variable names.
More information about the Python-list