[Tutor] searching out keywords...

Kirby Urner urnerk@qwest.net
Fri, 08 Mar 2002 10:03:16 -0800


>
> From looking around a bit I have this...but:
>
>import sys, string
>
># open the file to read
>inp = open(sys.argv[1],"r")
>outp = open("badwords.txt","w")
>
># create a keyword list
>kw = ["sex","teen"]
>
># read the file in and search for keywords
>for kw in inp.readlines():
>   outp.write(line)   # needs to go to outp file
>print "finished text processing"
>
># close em up
>inp.close()
>outp.close()

As written, this won't do what you want.

    for kw in inp.readlines():
      outp.write(line)   # needs to go to outp file

is going to make kw be each of the lines in
your 40MB log file, which overwrites what you set
it before.  The variable 'line' is not being
defined at all.

Closer to what you want is:

    kw = ["sex","teen"]

    for line in inp.readlines():
       for badword in kw:  # loop through list
          if line.find(badword) > -1: # found!
             outp.write(line)
             break

This is a loop in a loop, with the outer loop progressing
through lines of text, the inner loop searching for each
element in kw, with a break as soon as it finds one --
no need to keep checking the same line once one offense
is counted.

As mentioned above, this algorithm is going to give false
positives on words like Middlesex and canteen.

But the more immediate challenge is to properly accomodate
40MB files.  inp.readlines() will bring all 40MB into
RAM and/or virtual memory which may or may not be a
problem.  Let's assume that it is.

One way around this is to provide a sizehint in as your
argument to readlines(), which will limit the size of
what's brought in.  Rewritting to include this:

    kw = ["sex","teen"]

    while 1:
       chunk = inp.readlines(1024)  # get some lines
       if len(chunk)==0:            # quit if no more
           break
       for line in chunk:           # do the loop-de-loop
          for badword in kw:        # loop through list
             if line.find(badword) > -1: # found!
                outp.write(line)
                break

Probably you'll want to add some feedback to the user
that indicates how far the analysis has progressed.
You could use a counter and trigger a print every
time hit a multiple of 100 or something.

Kirby