Mailman 3 difflib.SequenceMatcher quick_ratio - Python-ideas - python.org

newer
Hooking between lexer and parser

difflib.SequenceMatcher quick_ratio

older
If branch merging

floyd

June 8, 2015

9:56 a.m.

Hi * I use this python line quite a lot in some projects: if difflib.SequenceMatcher.quick_ratio(None, a, b) >= threshold: I realized that this is performance-wise not optimal, therefore wrote a method that will return much faster in a lot of cases by using the length of "a" and "b" to calculate the upper bound for "threshold": if difflib.SequenceMatcher.quick_ratio_ge(None, a, b, threshold): I'd say we could include it into the stdlib, but maybe it should only be a python code recipe? I would say this is one of the most frequent use cases for difflib, but maybe that's just my biased opinion :) . What's yours? See http://bugs.python.org/issue24384 cheers, floyd

Reply

Sign in to reply online Use email software

Show replies by date

Serhiy Storchaka

June 2015

10:44 a.m.

On 08.06.15 10:56, floyd wrote:

If such function will be added, I think it needs better name. E.g. difflib.isclose(a, b, threshold).

Reply

Sign in to reply online Use email software

Tal Einat

5:15 p.m.

On Mon, Jun 8, 2015 at 11:44 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:

If such function will be added, I think it needs better name. E.g. difflib.isclose(a, b, threshold).

Indeed, this is somewhat similar in concept to the recently-added math.isclose() function, and could use a similar signature, i.e.: difflib.SequenceMatcher.isclose(a, b, rel_tol=None, abs_tol=None). However, the real issue here is whether this is important enough to be included in the stdlib. Are there any places in the stdlib where this would be useful? Can anyone other than the OP confirm that they would find having this in the stdlib particularly useful? Why should this be in the stdlib vs. a recipe? - Tal Einat

Reply

Sign in to reply online Use email software

Matthias Bussonnier

5:39 p.m.

On Jun 8, 2015, at 08:15, Tal Einat <taleinat@gmail.com> wrote:

On Mon, Jun 8, 2015 at 11:44 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:

...
If such function will be added, I think it needs better name. E.g. difflib.isclose(a, b, threshold).

Indeed, this is somewhat similar in concept to the recently-added math.isclose() function, and could use a similar signature, i.e.: difflib.SequenceMatcher.isclose(a, b, rel_tol=None, abs_tol=None).

However, the real issue here is whether this is important enough to be included in the stdlib.

One thing I found is that the fact that stdlib try to be smart (human friendly diff) make it horribly slow[1].

SequenceMatcher tries to compute a "human-friendly diff" between two sequences.

If you are only interested in a quick ratio, especially on long sequences, I would suggest using another algorithm which is not worse case scenario in n^3. On some sequences computing the diff was several order of magnitude faster for me[2] (pure python). At the point where quick-ratio was not needed. Note also that SequeceMatcher(None, a, b) might not give the same result/ratio that SequenceMatcher(None, b, a)

...
...
SequenceMatcher(None,'aba', 'bca').get_matching_blocks() [Match(a=0, b=2, size=1), Match(a=3, b=3, size=0)] # 1 common char

...
...
SequenceMatcher(None,'bca','aba').get_matching_blocks() [Match(a=0, b=1, size=1), Match(a=2, b=2, size=1), Match(a=3, b=3, size=0)] # 2 common chars.

— M [1] For my application, I don’t really care about having Human Friendly diff, but the actual minimal diff. I do understand the need for this algorithm though. [2] Well chosen Benchmark : http://i.imgur.com/cZPwR0H.png <http://i.imgur.com/cZPwR0H.png>

Reply

Sign in to reply online Use email software

Andrew Barnert

3:31 p.m.

If this really is needed as a performance optimization, surely you want to do something faster than loop over dozens of comparisons to decide whether you can skip the actual work? I don't know if this is something you can calculate analytically, but if not, you're presumably doing this on zillions of lines, and instead of repeating the loop every time, wouldn't it be better to just do it once and then just check the ratio each time? (You could hide that from the caller by just factoring out the loop to a function _get_ratio_for_threshold and decorating it with @lru_cache. But I don't know if you really need to hide it from the caller.) Also, do the extra checks for 0, 1, and 0.1 and for empty strings actually speed things up in practice?

Reply

Sign in to reply online Use email software

Serhiy Storchaka

June 2015

10:44 a.m.

On 08.06.15 10:56, floyd wrote:

If such function will be added, I think it needs better name. E.g. difflib.isclose(a, b, threshold).

Reply

Sign in to reply online Use email software

Tal Einat

5:15 p.m.

On Mon, Jun 8, 2015 at 11:44 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:

If such function will be added, I think it needs better name. E.g. difflib.isclose(a, b, threshold).

Indeed, this is somewhat similar in concept to the recently-added math.isclose() function, and could use a similar signature, i.e.: difflib.SequenceMatcher.isclose(a, b, rel_tol=None, abs_tol=None). However, the real issue here is whether this is important enough to be included in the stdlib. Are there any places in the stdlib where this would be useful? Can anyone other than the OP confirm that they would find having this in the stdlib particularly useful? Why should this be in the stdlib vs. a recipe? - Tal Einat

Reply

Sign in to reply online Use email software

Matthias Bussonnier

5:39 p.m.

On Jun 8, 2015, at 08:15, Tal Einat <taleinat@gmail.com> wrote:

On Mon, Jun 8, 2015 at 11:44 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:

...
If such function will be added, I think it needs better name. E.g. difflib.isclose(a, b, threshold).

Indeed, this is somewhat similar in concept to the recently-added math.isclose() function, and could use a similar signature, i.e.: difflib.SequenceMatcher.isclose(a, b, rel_tol=None, abs_tol=None).

However, the real issue here is whether this is important enough to be included in the stdlib.

One thing I found is that the fact that stdlib try to be smart (human friendly diff) make it horribly slow[1].

SequenceMatcher tries to compute a "human-friendly diff" between two sequences.

If you are only interested in a quick ratio, especially on long sequences, I would suggest using another algorithm which is not worse case scenario in n^3. On some sequences computing the diff was several order of magnitude faster for me[2] (pure python). At the point where quick-ratio was not needed. Note also that SequeceMatcher(None, a, b) might not give the same result/ratio that SequenceMatcher(None, b, a)

...
...
SequenceMatcher(None,'aba', 'bca').get_matching_blocks() [Match(a=0, b=2, size=1), Match(a=3, b=3, size=0)] # 1 common char

...
...
SequenceMatcher(None,'bca','aba').get_matching_blocks() [Match(a=0, b=1, size=1), Match(a=2, b=2, size=1), Match(a=3, b=3, size=0)] # 2 common chars.

— M [1] For my application, I don’t really care about having Human Friendly diff, but the actual minimal diff. I do understand the need for this algorithm though. [2] Well chosen Benchmark : http://i.imgur.com/cZPwR0H.png <http://i.imgur.com/cZPwR0H.png>

Reply

Sign in to reply online Use email software

Andrew Barnert

3:31 p.m.

If this really is needed as a performance optimization, surely you want to do something faster than loop over dozens of comparisons to decide whether you can skip the actual work? I don't know if this is something you can calculate analytically, but if not, you're presumably doing this on zillions of lines, and instead of repeating the loop every time, wouldn't it be better to just do it once and then just check the ratio each time? (You could hide that from the caller by just factoring out the loop to a function _get_ratio_for_threshold and decorating it with @lru_cache. But I don't know if you really need to hide it from the caller.) Also, do the extra checks for 0, 1, and 0.1 and for empty strings actually speed things up in practice?

Reply

Sign in to reply online Use email software

3572

Age (days ago)

3572

Last active (days ago)

Download

4 comments

5 participants

tags

participants (5)

Andrew Barnert
floyd
Matthias Bussonnier
Serhiy Storchaka
Tal Einat