Jarow-Winkler algorithm: Measuring similarity between strings
Øyvind
oyvinrog at gmail.com
Sat Dec 20 09:07:31 CET 2008
Thanks for the useful comments.
On 20 Des, 01:38, John Machin <sjmac... at lexicon.net> wrote:
> On Dec 20, 10:02 am, Øyvind <oyvin... at gmail.com> wrote:
>
> > Based on examples and formulas fromhttp://en.wikipedia.org/wiki/Jaro-Winkler.
>
> For another Python implementation, google "febrl".
>
> > Useful for measuring similarity between two strings. For example if
> > you want to detect that the user did a typo.
>
> You mean like comparing the user's input word with some collection of
> valid words? You would need to be using something else as a quick-and-
> dirty filter ... Jaro-Winkler is relatively slow.
Do you have any alternative suggestions?
>
>
>
> > def jarow(s1,s2):
>
> > """ Returns a number between 1 and 0, where 1 is the most similar
>
> > example:
>
> > print jarow("martha","marhta")
>
> > """
> > m= jarow_m(s1,s2)
> > t1 = jarow_t(s1,s2)
> > t2 = jarow_t(s2,s1)
> > t = float(t1)/float(t2)
>
> Huh? t1 and t2 are supposed to be counts of transpositions. So is t.
> So how come t is a ratio of t1 to t2?? BTW, suppose t2 is zero.
Good point. There should be only one jarow_t.
>
> Also as the Wikipedia article says,
> it's not a metric. I.e. it doesn't satisfy dist(a, c) <= dist(a, b) +
> dist(b, c).
Its not a mathematical correct metric, but it is a useful heuristical
metric.
>
> The above code is not symmetrical; jarow_m(s1, s2) does not
> necessarily equal jarow_m(s2, s1). The article talks about one "m",
> not two of them.
>
Hmm.. also a good point. I will make it count all the matches, not
just the matches in s1.
More information about the Python-list
mailing list