
Hi * I use this python line quite a lot in some projects: if difflib.SequenceMatcher.quick_ratio(None, a, b) >= threshold: I realized that this is performance-wise not optimal, therefore wrote a method that will return much faster in a lot of cases by using the length of "a" and "b" to calculate the upper bound for "threshold": if difflib.SequenceMatcher.quick_ratio_ge(None, a, b, threshold): I'd say we could include it into the stdlib, but maybe it should only be a python code recipe? I would say this is one of the most frequent use cases for difflib, but maybe that's just my biased opinion :) . What's yours? See http://bugs.python.org/issue24384 cheers, floyd

On Mon, Jun 8, 2015 at 11:44 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
If such function will be added, I think it needs better name. E.g. difflib.isclose(a, b, threshold).
Indeed, this is somewhat similar in concept to the recently-added math.isclose() function, and could use a similar signature, i.e.: difflib.SequenceMatcher.isclose(a, b, rel_tol=None, abs_tol=None). However, the real issue here is whether this is important enough to be included in the stdlib. Are there any places in the stdlib where this would be useful? Can anyone other than the OP confirm that they would find having this in the stdlib particularly useful? Why should this be in the stdlib vs. a recipe? - Tal Einat

One thing I found is that the fact that stdlib try to be smart (human friendly diff) make it horribly slow[1].
SequenceMatcher tries to compute a "human-friendly diff" between two sequences.
If you are only interested in a quick ratio, especially on long sequences, I would suggest using another algorithm which is not worse case scenario in n^3. On some sequences computing the diff was several order of magnitude faster for me[2] (pure python). At the point where quick-ratio was not needed. Note also that SequeceMatcher(None, a, b) might not give the same result/ratio that SequenceMatcher(None, b, a)
SequenceMatcher(None,'aba', 'bca').get_matching_blocks() [Match(a=0, b=2, size=1), Match(a=3, b=3, size=0)] # 1 common char
SequenceMatcher(None,'bca','aba').get_matching_blocks() [Match(a=0, b=1, size=1), Match(a=2, b=2, size=1), Match(a=3, b=3, size=0)] # 2 common chars.
— M [1] For my application, I don’t really care about having Human Friendly diff, but the actual minimal diff. I do understand the need for this algorithm though. [2] Well chosen Benchmark : http://i.imgur.com/cZPwR0H.png <http://i.imgur.com/cZPwR0H.png>

If this really is needed as a performance optimization, surely you want to do something faster than loop over dozens of comparisons to decide whether you can skip the actual work? I don't know if this is something you can calculate analytically, but if not, you're presumably doing this on zillions of lines, and instead of repeating the loop every time, wouldn't it be better to just do it once and then just check the ratio each time? (You could hide that from the caller by just factoring out the loop to a function _get_ratio_for_threshold and decorating it with @lru_cache. But I don't know if you really need to hide it from the caller.) Also, do the extra checks for 0, 1, and 0.1 and for empty strings actually speed things up in practice?

On Mon, Jun 8, 2015 at 11:44 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
If such function will be added, I think it needs better name. E.g. difflib.isclose(a, b, threshold).
Indeed, this is somewhat similar in concept to the recently-added math.isclose() function, and could use a similar signature, i.e.: difflib.SequenceMatcher.isclose(a, b, rel_tol=None, abs_tol=None). However, the real issue here is whether this is important enough to be included in the stdlib. Are there any places in the stdlib where this would be useful? Can anyone other than the OP confirm that they would find having this in the stdlib particularly useful? Why should this be in the stdlib vs. a recipe? - Tal Einat

One thing I found is that the fact that stdlib try to be smart (human friendly diff) make it horribly slow[1].
SequenceMatcher tries to compute a "human-friendly diff" between two sequences.
If you are only interested in a quick ratio, especially on long sequences, I would suggest using another algorithm which is not worse case scenario in n^3. On some sequences computing the diff was several order of magnitude faster for me[2] (pure python). At the point where quick-ratio was not needed. Note also that SequeceMatcher(None, a, b) might not give the same result/ratio that SequenceMatcher(None, b, a)
SequenceMatcher(None,'aba', 'bca').get_matching_blocks() [Match(a=0, b=2, size=1), Match(a=3, b=3, size=0)] # 1 common char
SequenceMatcher(None,'bca','aba').get_matching_blocks() [Match(a=0, b=1, size=1), Match(a=2, b=2, size=1), Match(a=3, b=3, size=0)] # 2 common chars.
— M [1] For my application, I don’t really care about having Human Friendly diff, but the actual minimal diff. I do understand the need for this algorithm though. [2] Well chosen Benchmark : http://i.imgur.com/cZPwR0H.png <http://i.imgur.com/cZPwR0H.png>

If this really is needed as a performance optimization, surely you want to do something faster than loop over dozens of comparisons to decide whether you can skip the actual work? I don't know if this is something you can calculate analytically, but if not, you're presumably doing this on zillions of lines, and instead of repeating the loop every time, wouldn't it be better to just do it once and then just check the ratio each time? (You could hide that from the caller by just factoring out the loop to a function _get_ratio_for_threshold and decorating it with @lru_cache. But I don't know if you really need to hide it from the caller.) Also, do the extra checks for 0, 1, and 0.1 and for empty strings actually speed things up in practice?
participants (5)
-
Andrew Barnert
-
floyd
-
Matthias Bussonnier
-
Serhiy Storchaka
-
Tal Einat