How fuzzy is get_close_matches() in difflib?
John Machin
sjmachin at lexicon.net
Fri Nov 17 16:45:30 EST 2006
On Nov 17, 7:19 pm, Steven D'Aprano <s... at REMOVEME.cybersource.com.au>
wrote:
[snip]
> You want to see "HIDEDCT1" match closer to "HIDESCT1" than "HIDEDST1":
>
> HIDEDCT1 -- John's "best match" target string
> HIDEDST1 -- difflib's "best match" target string
> HIDESCT1 -- source string
>
> John's best match matches in seven of eight positions, compared to six
> of eight for the difflib best match. Disregarding order, both have seven
> matching characters. That's a pretty slim difference between the two:
>
> >>> "".join(difflib.Differ().compare("HIDEDCT1", "HIDESCT1"))
' H I D E- D+ S C T 1'
>>> "".join(difflib.Differ().compare("HIDEDST1", "HIDESCT1"))
' H I D E- D S+ C T 1'
>
> I honestly don't know how to interpret those results :(
Take 1, firing from the hip:
The formatting of that output is suboptimal. Better would be:
' H I D E -D +S C T 1'
interpreted as "delete D, insert S"
' H I D E -D S +C T 1'
interpreted as "delete D, insert C"
After reflection that the author of difflib is known not to be so
silly, take 2:
| >>> list(difflib.Differ().compare("HIDEDCT1", "HIDESCT1"))
[' H', ' I', ' D', ' E', '- D', '+ S', ' C', ' T', ' 1']
And the docs shed some light on what is going on:
| >>> help(difflib.Differ().compare)
Help on method compare in module difflib:
compare(self, a, b) method of difflib.Differ instance
Compare two sequences of lines; generate the resulting delta.
Each sequence must contain individual single-line strings ending
with
newlines. Such sequences can be obtained from the `readlines()`
method
of file-like objects. The delta generated also consists of
newline-
terminated strings, ready to be printed as-is via the writeline()
method of a file-like object.
Example:
>>> print
''.join(Differ().compare('one\ntwo\nthree\n'.splitlines(1),
...
'ore\ntree\nemu\n'.splitlines(1))),
- one
? ^
+ ore
? ^
- two
- three
? -
+ tree
+ emu
HTH,
John
More information about the Python-list
mailing list