Fuzzy string matching?
Magnus L. Hetland
mlh at idt.ntnu.no
Thu Aug 26 21:31:09 CEST 1999
mlh at idt.ntnu.no (Magnus L. Hetland) writes:
> "Hans Nowak" <hnowak at cuci.nl> writes:
>
> > ...Anyone know of such a beast in Python?
[...]
> Other than that - fuzzy matching is a non-trivial problem. It is
> possible to use dynamic programming to make a quadratic time distance
> function that you can use to compare words or strings - to see how
> similar they are. (If you like, I can post a Python variant of it.)
I decided not to wait for your answer... :)
def distance(a,b):
c = {}
n = len(a); m = len(b)
for i in range(0,n+1):
c[i,0] = i
for j in range(0,m+1):
c[0,j] = j
for i in range(1,n+1):
for j in range(1,m+1):
x = c[i-1,j]+1
y = c[i,j-1]+1
if a[i-1] == b[j-1]:
z = c[i-1,j-1]
else:
z = c[i-1,j-1]+1
c[i,j] = min(x,y,z)
return c[n,m]
This calculates the minimal number of "operations" - where the
operations are adding, deleting, or translating a character - that are
needed to get from the string a to the string b. This, then, can be
used as a distance measure...
For instance
distance("big card of doom","bigg chard ov duum") == 5
distance("big card of doom","little card of phlegm") == 10
You could combine this with other operations, like making everything
lowercase, splitting into separate words and trying to look them up
(perhaps fuzzily) in a word-list - perhaps you could even do a
distance measure on the list-of-words in the same way...
I guess the simplest way to use it would be to just test it against
*all* the cards you have stored, and selecting the ones that got the
lowest distance as likely candidates... There might be some more
efficient algorithm, though...
(And if anyone wants an explanation as to how this distance function
really works, I'll be glad to supply it... I've done lectures about it
in an algorithm class - and will be doing so later this fall as well...)
>
> >
> > TIA,
--
Magnus Making no sound / Yet smouldering with passion
Lie The firefly is still sadder / Than the moaning insect
Hetland : Minamoto Shigeyuki
More information about the Python-list
mailing list