string search function
Christopher T King
squirrel at WPI.EDU
Fri Jul 23 11:17:31 EDT 2004
On 23 Jul 2004, gyromagnetic wrote:
> Are there improvements that can be made to the code below? Are there
> better alternatives?
There are two ways I can think of:
1)
You can replace the 'or' search with a regular expression:
import re
def bsearch(fterms, stext, btype='AND'):
if btype == 'AND': # boolean 'and' search
...
else: # boolean 'or' search
found = bool(re.search(fterms.join('|'),stext))
return found
Assuming Python's regexps are somewhat optimized, this should be about the
fastest search you can get with arbitrary strings. Unfortunately, there's
no direct 'and' equivalent with regexps (it can be done, but it's helluva
ugly). If you want to optimize 'and', you'll need to use whatever
advanced algorithms our friends at Google use ;)
2)
Assuming your search terms don't contain whitespace, and you want to
compare only entire words (not substrings), replace the string search with
a list search, simply by adding the line 'stext = stext.split()' to the
top of the search function. This will greatly speed up the search by
avoiding checking partial words. The same effect can be achieved with the
regexp above by appending '\s' to the beginning and end of it.
Continuing along this line (whole words, no whitespace), you can replace
the lists with sets for an even greater speedup. The resulting code would
look like this (with a bit of code stolen from sets.py):
from sets import Set
def ispartialsubset(self, other):
"""Report whether another set contains at least one member of this set."""
self._binary_sanity_check(other)
for elt in ifilter(other._data.has_key, self):
return True
return False
Set.ispartialsubset = ispartialsubset
def bsearch(fterms, stext, btype='AND'):
stext = stext.split()
if btype == 'AND': # boolean 'and' search
found = Set(fterms).issubset(Set(stext))
else: # boolean 'or' search
found = Set(fterms).ispartialsubset(Set(stext))
return found
If the same stext or fterms is used multiple times, you could move the
.split() and/or the Set() transformations outside of the function, so they
only have to be done once.
Again, this method (using sets) only works if your words contain no
whitespace, and you're not interested in matching parts of words. It's
/much/ faster than a string search, though.
Hope this helps!
More information about the Python-list
mailing list