Most efficient method to search text?
Robin Siebler
robin.siebler at corp.palm.com
Tue Oct 15 20:35:29 EDT 2002
I wrote a script to search a slew of files for certain words and
names. However, I am sure that there has to be a faster/better way to
do it. Here is what I am doing:
1. Load words to exclude into a list.
2. Load names to exclude for into a list.
3. Load words to include into a list.
4. Remove any duplicates from the name list.
5. Generate a list of files to search.
6. Open the 1st file.
7. Search each line:
a. For a word (line.find(word)). If I get a hit, I then use a RE
to perform a more exact search (the pattern I am using is
'\w+word|word\w+|word').
i. Compare any matches against the include/exclude list. If
it is a match, keep searching. Otherwise, log the line.
b. For a name (line.find(name)). If I get a hit, I then use a RE
to perform a more exact search (the pattern that I am using is
'\bname\b'. If I get a hit, log the line.
The reason that I first search using line.find() is that in the past I
have done some searches for simple strings and found that line.find()
was much faster than an RE, so I am only using an RE when I need to.
I am including my code below (unforunately, Google screws the
formating up). Any suggestions to improve it would be appreciated.
LinesFound = []; LinesFound = FunkyList(LinesFound); msg = ""
LineNum = 0; Header = 0
LogFile = var['LogFile']
print '\nGenerating file list...' #Let user see that script is
running
FilesToSearch = listFiles(var['SearchPath'], var['SearchExt'],
var['Recurse'])
if len(FilesToSearch) == 0:
print 'No Files Found!'
clean_up(var)
else:
print 'Number of files to search: ' + str(len(FilesToSearch))
print 'Processing files...', #Let user see that script is running
while FilesToSearch:
FileBeingSearched = FilesToSearch.pop() #Get/remove last file
name
open_file = open(FileBeingSearched)
print "\nProcessing " + FileBeingSearched,
for line in open_file.xreadlines():
LineNum += 1
#Let user see that script is running
if LineNum >24 and LineNum % 100==0: print ".",
for word in var['ExcludeWords']: #Search line for
proscribed words
#Perform a case insensitive search for word *anywhere* in the line
if line.lower().find(word.lower()) != -1:
pattern = '\w+word|word\w+|word'
pattern = pattern.replace('word', word.lower())
s_word = re.compile(pattern, re.IGNORECASE)
#If the phrase was found, get a list containing the matches
match_found = unique(s_word.findall(line))
for match in match_found:
#If the word contains an underscore
if match.find('_') != -1:
words = '\w+'
w_find = re.compile(words, re.IGNORECASE)
words = ''
for item in w_find.findall(line):
if item.find('_') != -1:
words = words + ' ' +
str(item.split('_'))
else:
words = words + ' ' + item
m_found = unique(s_word.findall(words))
for item in m_found:
if item in var['ExcludeWords'] and item
not in var['IncludeWords']:
msg = '\tLine ' + str(LineNum) + ':
The word "' + \
word + '" was found in: "' +
line.strip() + '"'
LinesFound.append(msg)
break;
elif match not in var['IncludeWords']:
#Is the word in IncludeWords?
msg = '\tLine ' + str(LineNum) + ': The
word "' + \
word + '" was found in: "' +
line.strip() + '"'
LinesFound.append(msg)
break;
for name in var['Names']:
#Search line for names
if line.lower().find(name.lower()) != -1:
#Perform a case insensitive search
pattern = '\bname\b'
pattern = pattern.replace('name', name)
s_word = re.compile(pattern, re.IGNORECASE)
match_found = unique(s_word.findall(line))
#If the phrase was found, get a list containing the matches
for match in match_found:
if match in var['Names']:
msg = '\tLine ' + str(LineNum) + ':
The name "' + name + \
'" was found in: "' + line.strip()
+'"'
LinesFound.append(msg)
break;
if len(LinesFound) > 0:
if not Header:
LogFile.write('Proscribed words were found in ' +
FileBeingSearched + '\n')
LogFile.write('\n')
Header = 1
for line in LinesFound:
LogFile.write(line + '\n')
LogFile.write('\n')
LinesFound = []
open_file.close()
LineNum = 0; Header = 0; hit = 0
print '\nProcessing Complete.'
More information about the Python-list
mailing list