[Python-ideas] difflib.SequenceMatcher quick_ratio

Matthias Bussonnier bussonniermatthias at gmail.com
Mon Jun 8 17:39:59 CEST 2015


> On Jun 8, 2015, at 08:15, Tal Einat <taleinat at gmail.com> wrote:
> 
> On Mon, Jun 8, 2015 at 11:44 AM, Serhiy Storchaka <storchaka at gmail.com> wrote:
> 
>> If such function will be added, I think it needs better name. E.g.
>> difflib.isclose(a, b, threshold).
> 
> Indeed, this is somewhat similar in concept to the recently-added
> math.isclose() function, and could use a similar signature, i.e.:
> difflib.SequenceMatcher.isclose(a, b, rel_tol=None, abs_tol=None).
> 
> However, the real issue here is whether this is important enough to be
> included in the stdlib.


One thing I found is that the fact that stdlib try to be smart (human friendly diff) make it horribly slow[1]. 

>  SequenceMatcher tries to compute a "human-friendly diff" between two sequences.

If you are only interested in a quick ratio, especially on long sequences, I would suggest using another algorithm
which is not worse case scenario in n^3.

On some sequences computing the diff was several order of magnitude faster for me[2] (pure python).
At the point where quick-ratio was not needed.

Note also that SequeceMatcher(None, a, b) might not give the same result/ratio that SequenceMatcher(None, b, a)

>>> SequenceMatcher(None,'aba', 'bca').get_matching_blocks()
 [Match(a=0, b=2, size=1), Match(a=3, b=3, size=0)]   # 1 common char

>>> SequenceMatcher(None,'bca','aba').get_matching_blocks()
 [Match(a=0, b=1, size=1), Match(a=2, b=2, size=1), Match(a=3, b=3, size=0)] # 2 common chars.

— 
M

[1] For my application, I don’t really care about having Human Friendly diff, but the actual minimal diff. 
     I do understand the need for this algorithm though. 
[2] Well chosen Benchmark : http://i.imgur.com/cZPwR0H.png <http://i.imgur.com/cZPwR0H.png>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150608/9d8dc876/attachment.html>


More information about the Python-ideas mailing list