Fuzzy string matching?
tim_one at email.msn.com
Fri Aug 27 06:27:39 CEST 1999
> I've gotten good results with ad hoc algorithms using a longest
> common contiguous substring routine. ...
Note that there's a capable & optimized SequenceMatcher class in
that does exactly that, also accepting an optional characterization of "junk"
sequence elements, and with methods to pump out a list of "how to change
sequence1 into sequence2" edit operations, or just return a "similarity ratio"
(a float in [0.0, 1.0], from "nothing in common" to "identical"). Any kind of
sequence is OK as input, so long as the elements are hashable and support a
__cmp__ method that can distinguish equal from not-equal (it doesn't have to
make sense of < or >).
> How to combine these lengths into a scalar measure of match is
> the really ad-hoc part of it.
I defy you to prove that SequenceMatcher's combination is wrong <wink>.
flamingly y'rs - tim
More information about the Python-list