How fuzzy is get_close_matches() in difflib?

Fri Nov 17 03:19:22 EST 2006

On Thu, 16 Nov 2006 22:59:41 -0800, John Henry wrote:

> I encountered a case where I am trying to match "HIDESST1" and
> "HIDESCT1" against ["HIDEDST1", "HIDEDCT1", "HIDEDCT2", "HIDEDCT3"]
> 
> Well, they both hit "HIDEDST1" as the first match which is not exactly
> the result I was looking for.  I don't understand why "HIDESCT1" would
> not hit "HIDEDCT1" as a first choice.

All those strings are so similar that I think the exact order of "best
match" will depend on the precise details of the algorithm. I note that
the difflib specifically says:

[quote]
This does not yield minimal edit sequences, but does tend to yield matches
that "look right" to people. 
[end quote]

There are many ways that one can calculate the difference between two
strings, e.g. is the string "acab" an "ab" with "ac" inserted at the
front, or "ab" with "ca" inserted in the middle? The Unix diff utility
may return minimal diffs, but difflib does not make that claim.

You want to see "HIDEDCT1" match closer to "HIDESCT1" than "HIDEDST1":

HIDEDCT1 -- John's "best match" target string
HIDEDST1 -- difflib's "best match" target string
HIDESCT1 -- source string

John's best match matches in seven of eight positions, compared to six
of eight for the difflib best match. Disregarding order, both have seven
matching characters. That's a pretty slim difference between the two:

>>> "".join(difflib.Differ().compare("HIDEDCT1", "HIDESCT1"))
'  H  I  D  E- D+ S  C  T  1'
>>> "".join(difflib.Differ().compare("HIDEDST1", "HIDESCT1"))
'  H  I  D  E- D  S+ C  T  1'

I honestly don't know how to interpret those results :(

>>> s = difflib.SequenceMatcher(None, "HIDEDCT1", "HIDESCT1")
>>> t = difflib.SequenceMatcher(None, "HIDEDST1", "HIDESCT1")
>>>
>>> for block in s.get_matching_blocks():
...     print "a[%d] and b[%d] match for %d elements" % block
...
a[0] and b[0] match for 4 elements
a[5] and b[5] match for 3 elements
a[8] and b[8] match for 0 elements
>>>
>>> for block in t.get_matching_blocks():
...     print "a[%d] and b[%d] match for %d elements" % block
...
a[0] and b[0] match for 4 elements
a[5] and b[4] match for 1 elements
a[6] and b[6] match for 2 elements
a[8] and b[8] match for 0 elements
>>>

I think what you are seeing is an artifact of the precise details of the
pattern matcher. Feel free to subclass the SequenceMatcher or Differ
classes to get the results you expect :-)

-- 
Steven D'Aprano