difflib get_close_matches improvement?
Peter Otten
__peter__ at web.de
Mon Dec 21 12:53:23 EST 2009
Neal Becker wrote:
> difflib.get_close_matches looks useful. But, I don't see where it defines
> 'close'. Besides that, wouldn't it be much more useful if one could
> supply their own distance metric?
If you have a distance function you can find the N best matches with
>>> from heapq import nsmallest
>>> from functools import partial
>>> from Levenshtein import distance
>>> possibilities = ["ape", "apple", "peach", "puppy"]
>>> nsmallest(3, possibilities, key=partial(distance, "appel"))
['ape', 'apple', 'puppy']
With a cutoff it gets a bit messier...
>>> pairs = ((distance("appel", v), v) for v in possibilities)
>>> pairs = ((score, v) for score, v in pairs if score <= 2)
>>> [v for score, v in nsmallest(3, pairs)]
['ape', 'apple']
so you would want to wrap it in a function, but if you have a look into
difflib.get_close_matches()...
def get_close_matches(word, possibilities, n=3, cutoff=0.6):
if not n > 0:
raise ValueError("n must be > 0: %r" % (n,))
if not 0.0 <= cutoff <= 1.0:
raise ValueError("cutoff must be in [0.0, 1.0]: %r" % (cutoff,))
result = []
s = SequenceMatcher()
s.set_seq2(word)
for x in possibilities:
s.set_seq1(x)
if s.real_quick_ratio() >= cutoff and \
s.quick_ratio() >= cutoff and \
s.ratio() >= cutoff:
result.append((s.ratio(), x))
# Move the best scorers to head of list
result = heapq.nlargest(n, result)
# Strip scores for the best n matches
return [x for score, x in result]
there is a lot of stuff that only makes sense if you use a SequenceMatcher
to calculate the similarity. For a generalized version you will probably
have to throw out the range check for the cuttof and the optimizations.
I don't think it's worthwile.
Peter
More information about the Python-list
mailing list