difflib get_close_matches improvement?

Mon Dec 21 12:53:23 EST 2009

Neal Becker wrote:

> difflib.get_close_matches looks useful.  But, I don't see where it defines
> 'close'.  Besides that, wouldn't it be much more useful if one could
> supply their own distance metric?

If you have a distance function you can find the N best matches with

>>> from heapq import nsmallest
>>> from functools import partial
>>> from Levenshtein import distance
>>> possibilities = ["ape", "apple", "peach", "puppy"]
>>> nsmallest(3, possibilities, key=partial(distance, "appel"))
['ape', 'apple', 'puppy']

With a cutoff it gets a bit messier...

>>> pairs = ((distance("appel", v), v) for v in possibilities)
>>> pairs = ((score, v) for score, v in pairs if score <= 2)
>>> [v for score, v in nsmallest(3, pairs)]
['ape', 'apple']

so you would want to wrap it in a function, but if you have a look into 
difflib.get_close_matches()...

def get_close_matches(word, possibilities, n=3, cutoff=0.6):
    if not n >  0:
        raise ValueError("n must be > 0: %r" % (n,))
    if not 0.0 <= cutoff <= 1.0:
        raise ValueError("cutoff must be in [0.0, 1.0]: %r" % (cutoff,))
    result = []
    s = SequenceMatcher()
    s.set_seq2(word)
    for x in possibilities:
        s.set_seq1(x)
        if s.real_quick_ratio() >= cutoff and \
           s.quick_ratio() >= cutoff and \
           s.ratio() >= cutoff:
            result.append((s.ratio(), x))

    # Move the best scorers to head of list
    result = heapq.nlargest(n, result)
    # Strip scores for the best n matches
    return [x for score, x in result]

there is a lot of stuff that only makes sense if you use a SequenceMatcher 
to calculate the similarity. For a generalized version you will probably 
have to throw out the range check for the cuttof and the optimizations.

I don't think it's worthwile.

Peter