Problem with difflib SequenceMatcher
Alain Ketterlin
alain at universite-de-strasbourg.fr.invalid
Mon Sep 12 08:18:29 EDT 2016
Jay <jay.sridhar at gmail.com> writes:
> I am having an odd problem with difflib.SequenceMatcher. Sample code below:
>
> The strings "src" and "trg" differ only a little.
How exactly? (Please be precise, it helps testing.)
> The SequenceMatcher.ratio() for these strings 0.0. Many other similar
> strings are working fine without problems (see below) with non-zero
> ratios depending on how much difference there is between strings (as
> expected).
Calling SM(...,trg[1:],src[1:]) gives plausible result. See also the
result of .get_matching_blocks() on your strings (it returns no matching
blocks).
It is all due to the "Autojunk" heuristics (see difflib's doc for
details), which considers the first characters as junk. Call
SM(...,autojunk=False).
I have no idea why the maintainers made this stupid autojunk idea the
default. Complain with them.
-- Alain.
> Tested on Python 2.7 on Ubuntu 14.04
>
> Program follows:
> ---
> from difflib import SequenceMatcher as SM
>
> src = u"N KPT T HS KMNST KNFKXNS AS H KLT FR 0 ALMNXN AF PRFT PRPRT AN
> RRL ARS T P RPLST P KMNS H ASTPLXT HS ANTSTRL KR0 PRKRM NN AS 0 KRT LP
> FRRT 0S PRKRM KLT FR 0 RPT TRNSFRMXN AF XN FRM AN AKRRN AKNM T A SSLST
> ANTSTRL SST"
> trg = u"M KPT T HS KMNST KNFKXNS AS H KLT FR 0 ALMNXN AF PRFT PRPRT AN
> RRL ARS T P RPLST P KMNS H ASTPLXT HS ANTSTRL KR0 PRKRM NN AS 0 KRT LP
> FRRT 0S PRKRM KLT FR 0 RPT TRNSFRMXN AF XN FRM AN AKRRN AKNM T SSLST
> ANTSTRL SST"
> print src, '\n', trg, '\n', SM(None, trg, src).ratio()
More information about the Python-list
mailing list