The "junk" parameter in difflib.sequencematcher

Fri Oct 17 11:55:42 EDT 2003

[shuhsien]
> I am confused by the junk parameter in the difflib.sequencematcher. I
> thought it would simply ignore everything that's returned true by the
> junk function.

It ignores elements for which isjunk(element) returns true for the purpose
of finding a synch region.  It doesn't ignore such elements for the purpose
of computing sequence length, neither for computing number of matches:

    The idea is to find the longest contiguous matching subsequence
    that contains no ``junk'' elements.

> >>> sequencematcher(lambda x: x == ' ', "lion", "li on").ratio()
> 0.88888888888888884

Using the symbols in the docs, this has M=4 and T=9, so the result is 8./9.

> >>> sequencematcher("lion", "li on").ratio()
> 0.0

That call doesn't make much sense.  It's passing "lion" as the isjunk
function, and comparing "li on" to the empty string.  It doesn't raise an
exception because the comparison is so trivial (an empty string has no
characters in common with any other string) it never even tries to call the
(bogus, in this case) isjunk function.

>>>> sequencematcher(lambda x: x == ' ', "lion", " lion ").ratio()
>>>> 0.80000000000000004

M=4, T=4+6=10, and 2.*4/10 = 0.8.

> It's not ignoring the blanks,

It refuses to match on blanks for the purpose of finding the longest
junk-free contiguous matching substring; that's all ignoring means.

> and when comparing "lion" and "li on", when nothing is considered junk,
> the similarity ratio is 0!

A correct way to spell that is, e.g.,

>>> difflib.SequenceMatcher(lambda ch: False, "lion", "li on").ratio()
0.88888888888888884