# Percentage matching of text

Tim Peters tim.peters at gmail.com
Fri Jul 30 22:07:18 CEST 2004

```[Bruce Eckel]
> Ah, but here's an interesting one:

all explained there.

> >>> sm(None, 'abcd', 'abdc').ratio()
> 0.75
>
> >>> sm(None, 'abcd', 'abxx').ratio()
> 0.5
>
> So if it matches half the string it's 50% but if the last two
> characters are out of order that's an additional 25%.

I believe if you asked any human "cold" which of those two examples
uses strings that were "more similar", they'd agree that the first
pair is more similar.

> Other examples:
>
> >>> sm(None, 'abcd10', 'abdc20').quick_ratio()
> 0.83333333333333337
> >>> sm(None, 'abcd10', 'abdc20').ratio()
> 0.66666666666666663
> >>> sm(None, 'abcd10', 'abdc20').real_quick_ratio()
> 1.0
>
> You get a different interpretation for each "speed" of ratio.

As the docs say,

def quick_ratio(self):
"""Return an upper bound on ratio() relatively quickly.

This isn't defined beyond that it is an upper bound on .ratio(), and
is faster to compute.
"""
...

def real_quick_ratio(self):
"""Return an upper bound on ratio() very quickly.

This isn't defined beyond that it is an upper bound on .ratio(), and
is faster to compute than either .ratio() or .quick_ratio().
"""

Diffs of long sequences can be expensive to compute, and these guys
are speed hacks normally used in the stylized way
difflib.get_close_matches() uses them:

if s.real_quick_ratio() >= cutoff and \
s.quick_ratio() >= cutoff and \
s.ratio() >= cutoff:

That way the potentially expensive s.ratio() isn't called if the
cheap-ass upper bounds are already so poor that there's no chance of
meeting the cutoff.

> I started thinking that all I wanted was a pass-fail kind of thing so
> I wondered if real_quick_ratio() might do the trick.

No, nothing is defined about what real_quick_ratio() computes, beyond
that it's an upper bound on .ratio().  It would be fine if it always
returned 1.0, regardless of input.  It wouldn't be *useful* then, but
it would be fine <wink>.  What's wrong with using .ratio() directly?
The speed-hack upper-bound methods can pay hugely in speeding MxN

...

> It seems like there's no way to get real_quick_ratio() to say anything
> except "it's a perfect match!"

Not so, but you can read the code to learn how.  What it does is about
the cheapest non-constant thing imaginable (it doesn't even look at
the elements of the sequences!), but it's still valuable for speeding
cross-product similarity applications.

```