Ensuring symmetry in difflib.SequenceMatcher

John Machin sjmachin at lexicon.net
Wed Nov 24 23:01:06 CET 2010


On Nov 24, 8:43 pm, Peter Otten <__pete... at web.de> wrote:
> John Yeung wrote:
> > I'm generally pleased with difflib.SequenceMatcher:  It's probably not
> > the best available string matcher out there, but it's in the standard
> > library and I've seen worse in the wild.  One thing that kind of
> > bothers me is that it's sensitive to which argument you pick as "seq1"
> > and which you pick as "seq2":
>
> > Python 2.6.1 (r261:67517, Dec  4 2008, 16:51:00) [MSC v.1500 32 bit
> > (Intel)] on
> > win32
> > Type "help", "copyright", "credits" or "license" for more information.
> >>>> import difflib
> >>>> difflib.SequenceMatcher(None, 'BYRD', 'BRADY').ratio()
> > 0.44444444444444442
> >>>> difflib.SequenceMatcher(None, 'BRADY', 'BYRD').ratio()
> > 0.66666666666666663
>
> > Is this a bug?  I am guessing the algorithm is implemented correctly,
> > and that it's just an inherent property of the algorithm used.  It's
> > certainly not what I'd call a desirably property.  Are there any
> > simple adjustments that can be made without sacrificing (too much)
> > performance?
>
> def symmetric_ratio(a, b, S=difflib.SequenceMatcher):
>     return (S(None, a, b).ratio() + S(None, b, a).ratio())/2.0
>
> I'm expecting 50% performance loss ;)
>
> Seriously, have you tried to calculate the ratio with realistic data?
> Without looking into the source I would expect the two ratios to get more
> similar.
>
> Peter

Surnames are extremely realistic data. The OP should consider using
Levenshtein distance, which is "symmetric". A good (non-naive)
implementation should be much faster than difflib.

ratio = 1.0 - levenshtein(a, b) / float(max(len(a), len(b)))



More information about the Python-list mailing list