Ensuring symmetry in difflib.SequenceMatcher
Peter Otten
__peter__ at web.de
Wed Nov 24 04:43:41 EST 2010
John Yeung wrote:
> I'm generally pleased with difflib.SequenceMatcher: It's probably not
> the best available string matcher out there, but it's in the standard
> library and I've seen worse in the wild. One thing that kind of
> bothers me is that it's sensitive to which argument you pick as "seq1"
> and which you pick as "seq2":
>
> Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit
> (Intel)] on
> win32
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import difflib
>>>> difflib.SequenceMatcher(None, 'BYRD', 'BRADY').ratio()
> 0.44444444444444442
>>>> difflib.SequenceMatcher(None, 'BRADY', 'BYRD').ratio()
> 0.66666666666666663
>>>>
>
> Is this a bug? I am guessing the algorithm is implemented correctly,
> and that it's just an inherent property of the algorithm used. It's
> certainly not what I'd call a desirably property. Are there any
> simple adjustments that can be made without sacrificing (too much)
> performance?
def symmetric_ratio(a, b, S=difflib.SequenceMatcher):
return (S(None, a, b).ratio() + S(None, b, a).ratio())/2.0
I'm expecting 50% performance loss ;)
Seriously, have you tried to calculate the ratio with realistic data?
Without looking into the source I would expect the two ratios to get more
similar.
Peter
More information about the Python-list
mailing list