[Python-bugs-list] [ python-Bugs-645629 ] SequenceMatcher finds suboptimal sequenc

Fri, 29 Nov 2002 08:03:26 -0800

Bugs item #645629, was opened at 2002-11-29 05:54
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=645629&group_id=5470

Category: Python Library
Group: Python 2.2.2
Status: Open
Resolution: None
Priority: 5
Submitted By: David Necas (yeti-dn)
Assigned to: Tim Peters (tim_one)
Summary: SequenceMatcher finds suboptimal sequenc

Initial Comment:
The algorithm used for approximate string matching
doesn't find the optimal edit sequence (it finds
longest blocks instead).

Example:

>>> from difflib import SequenceMatcher
>>> sm = SequenceMatcher()
>>> sm.set_seqs('axfot', 'aoftax')
>>> sm.ratio()
0.36363636363636365
>>> sm.get_matching_blocks()
[(0, 4, 2), (5, 6, 0)]
>>> sm.get_opcodes()
[('insert', 0, 0, 0, 4), ('equal', 0, 2, 4, 6),
('delete', 2, 5, 6, 6)]

What's wrong?

Levenshtein distance with weight 2 for item replacement
is only 5 (the weight 2 corresponds to what ratio() is
supposed to compute, the classic Levenshtein distance
is 4), so one would expect to get similarity (i.e.
ratio()) (11-5)/11 = 6/11 = 0.545454545454..., and not
only 4/11.

And really, the maximal matching blocks are:
[(0, 0, 1), (2, 2, 1), (4, 3, 1)]
and the minimal edit sequence is:
[('equal', 0, 1, 0, 1), ('replace', 1, 2, 1, 2),
('equal', 2, 3, 2, 3), ('delete', 3, 4, 3, 3),
('equal', 4, 5, 3, 4), ('insert', 5, 5, 4, 6)]

The impact of this ``feature'' on diff-like
applications may be even positive, beause the edit
sequence then consists of smaller number of operations
on lager chunks.  Thus I'm not sure if this is
something which should be fixed.  However, it should be
at least noted in the documentation the ratio()
function gives only a lower bound of the string
similarity (so people like me won't be tempted to use
it to check results of their own Levenshtein
distance/string similarity implementation).

----------------------------------------------------------------------

>Comment By: Tim Peters (tim_one)
Date: 2002-11-29 11:03

Message:
Logged In: YES 
user_id=31435

Please read the docs first.  This isn't the Levenshtein 
algorithm, and it doesn't care about finding minimal edit 
distance.

----------------------------------------------------------------------

Comment By: David Necas (yeti-dn)
Date: 2002-11-29 07:58

Message:
Logged In: YES 
user_id=658986

Sorry, I've changed my mind.  This definitely should be
fixed.  In following strings finding `Observation'
effectively inhibits finding the much better match:

sm.set_seqs('Observation: What seems as a small glitch at
the first sight may have large impact',
'What-seems-as-a-small-glitch-at-the-first-sight-may-have-large-impact
(Observation)')

Unfortunately this probably means complete rewrite, I can't
see how the current algorithm could be changed to work in
this case (but I don't understand it 100%, so maybe...).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=645629&group_id=5470