How fuzzy is get_close_matches() in difflib?

John Machin sjmachin at lexicon.net
Fri Nov 17 16:45:30 EST 2006



On Nov 17, 7:19 pm, Steven D'Aprano <s... at REMOVEME.cybersource.com.au>
wrote:
[snip]

> You want to see "HIDEDCT1" match closer to "HIDESCT1" than "HIDEDST1":
>
> HIDEDCT1 -- John's "best match" target string
> HIDEDST1 -- difflib's "best match" target string
> HIDESCT1 -- source string
>
> John's best match matches in seven of eight positions, compared to six
> of eight for the difflib best match. Disregarding order, both have seven
> matching characters. That's a pretty slim difference between the two:
>
> >>> "".join(difflib.Differ().compare("HIDEDCT1", "HIDESCT1"))
'  H  I  D  E- D+ S  C  T  1'
>>> "".join(difflib.Differ().compare("HIDEDST1", "HIDESCT1"))
'  H  I  D  E- D  S+ C  T  1'
>
> I honestly don't know how to interpret those results :(

Take 1, firing from the hip:

The formatting of that output is suboptimal. Better would be:
'  H  I  D  E -D +S  C  T  1'
interpreted as "delete D, insert S"
'  H  I  D  E -D  S +C  T  1'
interpreted as "delete D, insert C"

After reflection that the author of difflib is known not to be so
silly, take 2:

| >>> list(difflib.Differ().compare("HIDEDCT1", "HIDESCT1"))
['  H', '  I', '  D', '  E', '- D', '+ S', '  C', '  T', '  1']

And the docs shed some light on what is going on:

| >>> help(difflib.Differ().compare)
Help on method compare in module difflib:

compare(self, a, b) method of difflib.Differ instance
    Compare two sequences of lines; generate the resulting delta.

    Each sequence must contain individual single-line strings ending
with
    newlines. Such sequences can be obtained from the `readlines()`
method
    of file-like objects.  The delta generated also consists of
newline-
    terminated strings, ready to be printed as-is via the writeline()
    method of a file-like object.

    Example:

    >>> print
''.join(Differ().compare('one\ntwo\nthree\n'.splitlines(1),
    ...
'ore\ntree\nemu\n'.splitlines(1))),
    - one
    ?  ^
    + ore
    ?  ^
    - two
    - three
    ?  -
    + tree
    + emu

HTH,
John




More information about the Python-list mailing list