Jarow-Winkler algorithm: Measuring similarity between strings
John Machin
sjmachin at lexicon.net
Sat Dec 20 01:38:25 CET 2008
On Dec 20, 10:02 am, Øyvind <oyvin... at gmail.com> wrote:
> Based on examples and formulas fromhttp://en.wikipedia.org/wiki/Jaro-Winkler.
For another Python implementation, google "febrl".
> Useful for measuring similarity between two strings. For example if
> you want to detect that the user did a typo.
You mean like comparing the user's input word with some collection of
valid words? You would need to be using something else as a quick-and-
dirty filter ... Jaro-Winkler is relatively slow.
>
> def jarow(s1,s2):
>
> """ Returns a number between 1 and 0, where 1 is the most similar
>
> example:
>
> print jarow("martha","marhta")
>
> """
> m= jarow_m(s1,s2)
> t1 = jarow_t(s1,s2)
> t2 = jarow_t(s2,s1)
> t = float(t1)/float(t2)
Huh? t1 and t2 are supposed to be counts of transpositions. So is t.
So how come t is a ratio of t1 to t2?? BTW, suppose t2 is zero.
One usually prefers symmetry i.e dist(s1, s2) == dist(s2, s1). You
can't have symmetry if t = t1/t2. Also as the Wikipedia article says,
it's not a metric. I.e. it doesn't satisfy dist(a, c) <= dist(a, b) +
dist(b, c).
>
> d = 0.1
>
> # this is the jaro-distance
> d_j = 1.0/3.0 * ((m/len(s1)) + (m/len(s2)) + ((m - t)/float(m)))
>
> # if the strings are prefixed similar, they are weighted more
> heavily
> l = winkler_l(s1,s2)
> print l
> return d_j + (l * 0.1 * (1 - d_j))
>
> def winkler_l(s1,s2):
> """ Number of the four first characters matching """
>
> l = 0
> counter = 0
> for s_j,s_i in zip(s1,s2):
>
> if s_j == s_i:
>
> l += 1
> counter += 1
>
> if counter > 4:
> break
>
> return l
>
> def jarow_m(s1,s2):
>
> """ Number of matching characters """
This code ignores the caveat from the Wikipedia article
"""
Two characters from s1 and s2 respectively, are considered matching
only if they are not farther than \left\lfloor\frac{\max(|s_1|,|s_2|)}
{2}\right\rfloor-1.
"""
which looks as though it is missing the words "characters apart" or
some such from the end of it. FWIW I've never seen this "distance
apart" restriction expressed unambiguously in the case where the
strings are of unequal length.
> m = 0
> d = {}
> for s in s1:
>
> d[s] = True
>
> for s in s2:
>
> if d.has_key(s):
>
> m += 1
> return m
The above code is not symmetrical; jarow_m(s1, s2) does not
necessarily equal jarow_m(s2, s1). The article talks about one "m",
not two of them.
> def jarow_t(s1,s2):
>
> """
> Number of transpositions
>
> """
>
> t= 0
> pos ={}
> counter = 0
> for s in s1:
>
> pos[s] = counter
> counter += 1
> counter = 0
> for s in s2:
>
> if pos.has_key(s):
>
> if pos[s] != counter:
This test is likely to come up with the wrong answer if the string
lengths differ.
>
> t += 1
>
> counter += 1
>
> return t
Cheers,
John
More information about the Python-list
mailing list