[Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

Mon Jul 12 05:02:48 CEST 2010

[Terry Reedy]
> I had considered the possibility of option A for 2.7 and A & C for 3.2. But
> see below.
>
> Since posting, I did an experiment with a 700 char paragraph of text (the
> summary from the post) compared to an 'edited' version. I did the
> comparision with and without the current heuristic. I did not notice much
> time difference (a couple of seconds either way) and the edit list was
> essentially the same.

When the sequences contain M (in seq1) and N (in seq2) instances of a
given element, the inner matching loop goes around M*N times
processing them.  Unless the element is identified as "junk" and/or
"popular", in which case the outer loop finds the M instances (in
seq1), one at a time, and the inner loop is skipped for each (because
b2j does not contain the junk/popular element as a key, so the inner
loop ends up iterating over the empty "nothing" list).

Of course the speed difference is minor unless M*N is large.  For
example, when M and N are both on the order of thousands, M*N is on
the order of millions.  Such cases motivated the original "auto junk"
heuristic (large "before" and "after" text files with thousands of
repetitions of boilerplate lines - I suspect at least some of this
input was computer-generated).

> The heuristic lowered the reported match ratio from .96 to .88, which
> would be bad when one wanted the unaltered value.

Except that the ratio is defined as being twice the number of matches
divided by the sum of the lengths of the sequences, so has no
interpretation independent of exactly how matching was performed.
Junk can have a very large effect on this.  IOW, the ratio is not a
property of the sequences, it emerges from all the details of how
matches were found.

BTW, it's not clear whether ratio() computes a _useful_ value in the
presence of junk, however that may be defined.  Perhaps it would be
better to, e.g., not use the lengths of the matches or of the input
sequences directly, but to adjust all of them by pretending junk
elements didn't exist.  Or not - it's unclear.

For example, if the sequences are

    ['j1', 'a', 'j2']
and
    ['j3', 'a', 'j4']

where j1, j2, j3, and j4 are considered to be junk, what _is_ a useful
ratio?  As is, it returns 0.333 (1 match, so returns 2.0 * 1 / (3 + 3)
= 1/3).  But if we didn't count junk elements at all, the "correct"
ratio would be 1.0 (2.0 * 1 match, out of 2 total non-junk elements in
the input sequences).

I suspect nobody cares ;-)

> I do not know which, if any, chars other than 'e' were junked as that info
> currently is not exposed. I propose below that it should be.

An auto-junk-detection gimmick certainly should allow finding out what
it decided junk was.

> I intentionally did not list as options
>
> D. Keep the status quo that is buggy for certain uses.
>
> Good things often have more uses than the inventor intended or expected.
> They should not be prevented.
>
> E. Completely remove the heuristic, which would restore 'buggy' performance
> for other uses.
>
> One of the closed issues was E, rejected for that reason.
>
> ---
> I also did not list one of my original ideas, but after the discussion it is
> looking better to me. It is based on the following statement of the current
> heuristic:
>
> "Disregard as junk common items that occur in more that 1% of the positions
> in second sequence b, as long as b is long enough so that duplicates cannot
> be called common."

I'm not sure what "so that duplicates cannot be called common" means.
Being common shouldn't, on its own, disqualify something from being
junk.

The essence of (a better notion of) junk probably has much more to do
with "lumpiness" in the distribution of element frequencies.  In the
DNA example, I bet no one of {A, G, C, T} appears "way more"
frequently than any other, so it's crazy to call any of them "junk".
But, e.g., if we were trying to compare simple melodies, a "silence"
token would probably be very frequent compared to any individual note.

> Tim, I do not know if you remember why you choose 200 as the cutoff,

I was afraid the auto-junk heuristic would end up doing silly things
on short sequences, and 200 "isn't short" - LOL.  Sheesh.

> but the above assumes that the following in not just a coincidence:
>
> (2 > 199*.01) == True
> (2 > 200*.01) == False
>
> In other words, 200 is the smallest length for b that prevents the junking
> of duplicates.

Well, the rule was also "> 1%", and floating arithmetic isn't used for
that part ... it's here:

                if n >= 200 and len(indices) * 100 > n:

where n is the length of the sequence.  len(indices) * 100 > n is
mathematically (not necessarily numerically) equivalent to
len(indices) > n / 100, or, IOW, the thing appears in more than 1% of
the positions.  It certainly was the intent that nothing would be
called junk unless it appeared at least twice, so the "n >= 200"
clause ensures that 1% of n is at least 2.

> F. Generalize the heuristic by replacing '1' with 'k', where k can be None
> (no heuristic) or 1-99. If not None, replace 200 by 200/k to minimally avoid
> junking of duplicates. If len(b) >= 200/k, then item counts should be
> greater than (len(b)*k)//100, which need only be calculated once.

Since cleverness already screwed people here, I'd rather be more
straightforward:  allow to specific the percentage as a float in [0.0,
100.0], and also allow to specify a minimum count directly.  Like
"call something junk if it appears in more than 0.231 of the positions
and also appears at least 42 times".  That would be a 2-tuple rather
than a single number, with default (1.0, 2) to match current behavior.

However, I'm wary of introducing a generalization in the absence of
experience saying people would use it.  Is this the right kind of
parametrization?  Is this even the right kind of way to go about
auto-detecting junk?  I know it worked great for the original use case
that drove it, but I haven't seen anyone say they want a notion of
auto-junk detection for other uses - just that they _don't_ want the
wholly inappropriate current auto-junk detection in (some or all) of
their uses.

IOW, it's hard to generalize confidently from a sample of one :-(

> Implementation: Add a new parameter named 'common' or 'threshold' or
> whatever that defaults to 1.

I'd call it "autojunk", cuz that's what it would do.  Also a useful
syntactic overlap with the name of the current "isjunk" argument.

> After computing b2j without the heuristic, if
> 'common' is not None, prune b2j as described above.
>
> My thinking here is that a user either knows or can easily find out the
> length of b and the size of the intented or actual alphabet of b (the latter
> is len(set(b)). So the user can conditionally call SequenceMatcher with
> 'common' set to None or an int as appropriate, perhaps after some
> experimentation. So the threshold is the only tuning parameter actually
> needed, and it also allows the heuristic to be turned off.
>
> The reason I did not list this before is the problem with 2.7.1. F, unlike
> option A, overtly changes the api, and some would object to that for 2.7
> even though is really is a bugfix.

Shouldn't be a problem for a major release (like 2.7), but would be
for a minor (bugfix) release (like 2.7.1).

> However, option F will not not break code while the covert change of
> option A could break code. So this may be the best option for a bad
> situation. It is a minimal change that gives the user complete flexibility.

Certainly an improvement, just unsure whether the flexibility provided
will prove to be useful (beyond the ability to specify None to turn
auto-junk off, which is the burning issue).

> In other words, I see three options for 2.7.1+:
> D. keep the current sometimes buggy behavior
> A. covertly change the heuristic to mostly fix it but possibly break some
> uses.
> F. add a parameter, with a no-change default, that allows the user to select
> a change if and when wanted.
>
> Another advantage of F is that then 2.7 and 3.2 would get the same change.

A very good thing - if it's the right change ;-)

> Other changes that apply regardless of the heuristic/api change:
>
> Update the code to use sets (newer than difflib) instead of dicts with
> values set to 1.
>
> Directly expose the set of 'common' items as an additional attribute of
> SequenceMatcher instances. Such instance attributes are currently
> undocumented, so adding one can hardly be a problem. Add documention
> thereof. Being able to see the effect of the heuristic when it is not turned
> off might help people decide whether or not to use it, or how to tune the
> threshold for smallish alphabets where 1 is too small.

Wholly agreed.  junkdict (after turning it into a set) should also be
exposed - when someone passes in a fancy regexp matcher for the isjunk
argument, they can be surprised at what their regexp matches.  Being
able to see the results can be helpful there too, for debugging.