Fuzzy string matching?

Tim Peters tim_one at email.msn.com
Fri Aug 27 06:27:39 CEST 1999


[Al Christians]
> I've gotten good results with ad hoc algorithms using a longest
> common contiguous substring routine. ...

Note that there's a capable & optimized SequenceMatcher class in

    Tools/Scripts/ndiff.py

that does exactly that, also accepting an optional characterization of "junk"
sequence elements, and with methods to pump out a list of "how to change
sequence1 into sequence2" edit operations, or just return a "similarity ratio"
(a float in [0.0, 1.0], from "nothing in common" to "identical").  Any kind of
sequence is OK as input, so long as the elements are hashable and support a
__cmp__ method that can distinguish equal from not-equal (it doesn't have to
make sense of < or >).

> ...
> How to combine these lengths into a scalar measure of match is
> the really ad-hoc part of it.

I defy you to prove that SequenceMatcher's combination is wrong <wink>.

flamingly y'rs  - tim






More information about the Python-list mailing list