[Python-ideas] difflib.SequenceMatcher quick_ratio

Mon Jun 8 15:31:34 CEST 2015

If this really is needed as a performance optimization, surely you want to do something faster than loop over dozens of comparisons to decide whether you can skip the actual work?

I don't know if this is something you can calculate analytically, but if not, you're presumably doing this on zillions of lines, and instead of repeating the loop every time, wouldn't it be better to just do it once and then just check the ratio each time? (You could hide that from the caller by just factoring out the loop to a function _get_ratio_for_threshold and decorating it with @lru_cache. But I don't know if you really need to hide it from the caller.)

Also, do the extra checks for 0, 1, and 0.1 and for empty strings actually speed things up in practice?

> On Jun 8, 2015, at 00:56, floyd <floyd at floyd.ch> wrote:
> 
> Hi *
> 
> I use this python line quite a lot in some projects:
> 
> if difflib.SequenceMatcher.quick_ratio(None, a, b) >= threshold:
> 
> I realized that this is performance-wise not optimal, therefore wrote a
> method that will return much faster in a lot of cases by using the
> length of "a" and "b" to calculate the upper bound for "threshold":
> 
> if difflib.SequenceMatcher.quick_ratio_ge(None, a, b, threshold):
> 
> I'd say we could include it into the stdlib, but maybe it should only be
> a python code recipe?
> 
> I would say this is one of the most frequent use cases for difflib, but
> maybe that's just my biased opinion :) . What's yours?
> 
> See http://bugs.python.org/issue24384
> 
> cheers,
> floyd
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/