speed problems

Thu Jun 3 12:11:38 EDT 2004

Steve Lamb <grey at despair.dmiyu.org> wrote in
news:slrncbuebm.i05.grey at dmiyu.org: 

> Python:
>>     for line in lf.readlines():
>>       if string.count( line, "INFECTED" ):
>>         vname = re.compile( "INFECTED \((.*)\)" ).search( line
>>         ).group(1) 
> 
>     If I read this correctly you're compiling this regex every time
>     you're 
> going through the for loop.  So every line the regex is compiled
> again.  You might want to compile the regex outside the loop and only
> use the compiled version inside the loop.  
> 
>     I *think* that Perl caches compiled regexs which is why they don't
>     have 
> two different ways of calling the regex while Python, in giving two
> different calls to the regex, will compile it every time if you
> expressedly call for a compile.  Again, just a guess based on how I
> presume the languages work and how I'd write them differently.

No, Python will cache the calls to compile the regex so you won't get much 
speed difference unless you have enough different regexes to overflow the 
cache. Pulling the compile out of the loop is a good idea on general 
principles though.

The code you quoted does have one place to optimise: using readlines, 
especially on a large file will be *much* slower than just iterating over 
the file object directly.

i.e. use

    for line in lf:
        ... whatever ...

Some other things that could be improved (although I suspect the real 
problem was calling readlines):

The original code posted uses functions from the string module. Using 
string methods instead ought to be faster e.g. line.count("INFECTED") 
instead of string.line(count, "INFECTED")

Use 
    if logfile.endswith('.gz'):
instead of:
    if logfile[-3:] == '.gz':

Use:
    if "INFECTED" in line:
instead of calling line.count

I don't understand why the inner loop needs two cases, one for vname 
containing a ',' and one where it doesn't. It looks to me as though the 
code could just split whether or not there is a comma. If there isn't one 
it just returns the original string.

Untested revised code:

    INFECTEDPAT = re.compile( "INFECTED \((.*)\)" )
    for line in lf:
      if "INFECTED" in line:
        vname = INFECTEDPAT.search(line).group(1)
        for vnam in vname.split(", "):
           if vnam not in virstat:
              virstat[vnam] = 1
           else:
              virstat[vnam] += 1
           total += 1
    lf.close()