String Comparisons returning score

Tim Peters tim.one at home.com
Sat Sep 1 22:13:23 EDT 2001


[Clayton Brown - Emmie Osawa]
> Is there an approved standard library/function/algarithm for comparing
> two similar strings and returning a percentage match?

See the std difflib module in Python 2.1; the guts of that appeared in
earlier Python releases as part of the ndiff.py utility; it implements an
algorithm related to Ratcliff and Obershelp's "gestalt" pattern matching.

> I am aware of soundEx.py / .c  which is based on the grammar and
> phonetics of words, but from what I have read it seems to be flawed..
> and thus removed from the python standard library.

It was removed more because Soundex isn't well-defined (even Knuth's
definition changed between editions 2 and 3 of TAoCP volume 3), and it was a
PITA to keep arguing about which was "the right" version.  The version we
had didn't correspond to any known published version anyway.  In any case,
Soundex was specifically designed to help match Anglo and some West European
surnames, and uses beyond that were always ill-advised.

> I have noticed similar techniques in other languages which are based
> on shift matrixes, working out the minimum number of changes to
> transform string A into string B.

There are dozens of possibilities.

> I am more looking for one which looks at
> words/
> chars/
> char-order/
> length/
> similarity
> perhaps omitting spaces, and a common library (the,a,and,mr,mrs......)
> with a weighted scoring mechanism...

In that case, there are thousands of possibilities <0.5 wink>.

difflib-offers-one-ly y'rs  - tim





More information about the Python-list mailing list