[Tutor] searching out keywords...

Javier JJ python.tutorial@jarava.org
Fri, 8 Mar 2002 18:43:53 +0100


First of all, I'm far from being an expert in ... well, anything, but much
less in what I'm going to say :) So take my words with a huge grain of
salt...

A few comments:



> I have to monitor some hefty (min 40MB) log files. I have a list of
keywords that I want to search for and then write those lines out to a
different file. Basically I want to do this.
>
> 1. run the py file with the log file as an ARG on the command line.
> 2. have it search that file for keywords
> 3. insert the lines that the keywords appears on into a different file
>
> From looking around a bit I have this...but:
>
> import sys, string
>
> # open the file to read
> inp = open(sys.argv[1],"r")
> outp = open("badwords.txt","w")
>
> # create a keyword list
> kw = ["sex","teen"]
>
> # read the file in and search for keywords
> for kw in inp.readlines():
>   outp.write(line)   # needs to go to outp file
> print "finished text processing"

This won't work. When you do "for kw in inp.readlines" what you do is that
you are assigning the lines in the log file to the "kw" variable ..

This is a (rogugh, probably very wrong and simple) way to do it that
works....

>>> for i in inp.readlines():    # read in all the log file and go line by
line
...  for j in kw:                            # for each of the keywords
...   if i.find(j) <>-1:                # See if the keyword is part of the
line. It has the advantage of also
                                                # finding it if it's a part
of a word
...    print i                            # Or whatever

Probably it'd be a good idea to do j.lower() and i.lower() to avoid "caps
mismatch" ...

> # close em up
> inp.close()
> outp.close()
>
> Help would be appreciated...I am learning wxPython as well and will put a
gui on it once this is done.
>
> Bob

A couple of notes on performance:

inp.readlines() does read all the file into memory; if the log is big, that
is quite a performance it. So it'd probably be better to use a read()
loop...

So, if performance is important, I'd say you'd be better off using something
as simple as (on a command line)

 cat log_fie | grep keyword > badwords_file

This is much faster :-)

But now I'll crawl back under the stone and let the real experts speak :-)

    Javier
----

Windows N'T: as in Wouldn't, Couldn't, and Didn't.