boolean searching with keywords
charlotte at henkle.com
Wed Sep 18 00:23:09 CEST 2002
I'm trying to add some new functionality to a program I've written in
Python. This is the first program I've ever written in Python, so
please forgive my newbieness. I wouldn't be at all shocked if I
haven't picked the easiest solution to various problems. ;)
This is how the program currently works. The first thing the program
does when it starts is open a pop connection to see if there's any
mail in its mailbox. If there's mail, it pulls down the first
message. It parses the message looking for URL, pulls them out and
appends them to a URL list. One it has a URL list, it begins to
follow the URLs so that it can pull down the HTML.
The program also has a list of key words. The keywords text file is a
CR deliminated list of words to look for in the HTML. If the program
finds ANY of the keywords, it sends off an email with a copy of the
HTML it's found. Here are the important functions:
f = open('Keywords_Business.txt', 'r')
keys = map(lambda s: s[:-1],f.readlines())
for eachLine in urlList:
print("Following this URL:" +eachLine)
regexp = r'\b(' + '|'.join(keys)+ r')\b'
newsearch = re.compile(regexp)
found = newsearch.search(d)
subject = found.group()
else: return 0
So far, so good (although if you can see improvements in this, please
let me know).
This works fine for single words on a line in a file. However, now I
want to add a Boolean search capability. IE, instead of searching for
"Frogs" or "Green" or "Lizards", I'd like to be able to search for
"frogs NOT green" or "Lizards AND green" or "green AND scaly". As
before, if any of the phrases are found, I want to break out of the
search and send mail.
I'm not sure how to do this. I considered making a tree structure,
and then walking the tree, but I was unable to get it working
correctly. Additionally, I was unsure if this was making the problem
too complex: Is there an easier way to get the functionality I want?
I would appreciate any help.
More information about the Python-list