Percentage matching of text

Tim Peters tim.peters at gmail.com
Fri Jul 30 22:07:18 CEST 2004


[Bruce Eckel]
> Ah, but here's an interesting one:

Not really <wink>.  Do read the module docstrings and comments.  It's
all explained there.

> >>> sm(None, 'abcd', 'abdc').ratio()
> 0.75
>
> >>> sm(None, 'abcd', 'abxx').ratio()
> 0.5
> 
> So if it matches half the string it's 50% but if the last two
> characters are out of order that's an additional 25%.

I believe if you asked any human "cold" which of those two examples
uses strings that were "more similar", they'd agree that the first
pair is more similar.

> Other examples:
> 
> >>> sm(None, 'abcd10', 'abdc20').quick_ratio()
> 0.83333333333333337
> >>> sm(None, 'abcd10', 'abdc20').ratio()
> 0.66666666666666663
> >>> sm(None, 'abcd10', 'abdc20').real_quick_ratio()
> 1.0
> 
> You get a different interpretation for each "speed" of ratio.

As the docs say,

    def quick_ratio(self):
        """Return an upper bound on ratio() relatively quickly.

        This isn't defined beyond that it is an upper bound on .ratio(), and
        is faster to compute.
        """
...

    def real_quick_ratio(self):
        """Return an upper bound on ratio() very quickly.

        This isn't defined beyond that it is an upper bound on .ratio(), and
        is faster to compute than either .ratio() or .quick_ratio().
        """

Diffs of long sequences can be expensive to compute, and these guys
are speed hacks normally used in the stylized way
difflib.get_close_matches() uses them:

        if s.real_quick_ratio() >= cutoff and \
           s.quick_ratio() >= cutoff and \
           s.ratio() >= cutoff:

That way the potentially expensive s.ratio() isn't called if the
cheap-ass upper bounds are already so poor that there's no chance of
meeting the cutoff.

> I started thinking that all I wanted was a pass-fail kind of thing so
> I wondered if real_quick_ratio() might do the trick.

No, nothing is defined about what real_quick_ratio() computes, beyond
that it's an upper bound on .ratio().  It would be fine if it always
returned 1.0, regardless of input.  It wouldn't be *useful* then, but
it would be fine <wink>.  What's wrong with using .ratio() directly? 
The speed-hack upper-bound methods can pay hugely in speeding MxN
cross-product kinds of comparison tasks.

...

> It seems like there's no way to get real_quick_ratio() to say anything
> except "it's a perfect match!"

Not so, but you can read the code to learn how.  What it does is about
the cheapest non-constant thing imaginable (it doesn't even look at
the elements of the sequences!), but it's still valuable for speeding
cross-product similarity applications.



More information about the Python-list mailing list