Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

July 13, 2010

      On 7/11/2010 11:02 PM, Tim Peters wrote:
...
...
The heuristic lowered the reported match ratio from .96 to .88, which
would be bad when one wanted the unaltered value.
...
BTW, it's not clear whether ratio() computes a _useful_ value in the
presence of junk, however that may be defined.
I agree, which is one reason why one should be to disable auto-junking.

There are a number of statistical methods for analyzing similarity 
matrices, analogous to correlation matrices, except that entries are in 
[0.0,1.0] rather than [-1.0,1.0]. For my Ph.D. thesis, I did such 
analyses for sets of species. Similarity measures should ususally be 
symmetric and increase with greater matching. The heuristic can cause 
both to fail.
...
I suspect nobody cares ;-)
There are multiple possible definitions of similarity for sets (and 
arguments thereabout). I am sure the same is true for sequences. But I 
consider the definition for .ratio, without the heuristic, to be 
sensible. I would consider using it should the occasion arise.
...
It certainly was the intent that nothing would be
called junk unless it appeared at least twice, so the "n>= 200"
clause ensures that 1% of n is at least 2.
Since 2 cannot be greater than something that is at least 2, you ensured 
that nothing would be called junk unless it appears as least thrice.
...
However, I'm wary of introducing a generalization in the absence of
experience saying people would use it.  Is this the right kind of
parametrization?  Is this even the right kind of way to go about
auto-detecting junk?  I know it worked great for the original use case
that drove it, but I haven't seen anyone say they want a notion of
auto-junk detection for other uses - just that they _don't_ want the
wholly inappropriate current auto-junk detection in (some or all) of
their uses.
IOW, it's hard to generalize confidently from a sample of one :-(
...
Implementation: Add a new parameter named 'common' or 'threshold' or
whatever that defaults to 1.
I'd call it "autojunk", cuz that's what it would do.  Also a useful
syntactic overlap with the name of the current "isjunk" argument.
I like that. I am now leaning toward the following?

G (I hope, this time, for 'go' ;-). For 2.7.1, 3.1.3, and 3.2, add 
'autojunk = True' to the constructor signature. This is the minimal 
change that fixes the bug of no choice while keeping the default as is. 
So it is a minimal violation of the usual stricture against API changes 
in bugfix releases. I would doc this as "Use an internal heuristic that 
identifies 'common' items as junk." and separately describe the 'current 
heuristic', leaving open the possibility of changing it.

Possible suboption: enforce 'autojunk in (True,False)' so the user 
cannot forget that it is a switch and not a tuning parameter.

In 3.2, expose as an attribute a tuple 'hueristic' or '_heuristic' with 
the tuning parameters for the current heuristic. Adding the _ would 
indicate that is it a private, expert-only, use at your own risk, 
subject to change attribute.

If we agree on this much, we can then discuss what the tuple should be 
for 3.2.
...
...
Other changes that apply regardless of the heuristic/api change:
Update the code to use sets (newer than difflib) instead of dicts with
values set to 1.
...
...
Directly expose the set of 'common' items as an additional attribute of
SequenceMatcher instances. Such instance attributes are currently
undocumented, so adding one can hardly be a problem. Add documention
thereof. Being able to see the effect of the heuristic when it is not turned
off might help people decide whether or not to use it, or how to tune the
threshold for smallish alphabets where 1 is too small.
Wholly agreed.  junkdict (after turning it into a set) should also be
exposed - when someone passes in a fancy regexp matcher for the isjunk
argument, they can be surprised at what their regexp matches.  Being
able to see the results can be helpful there too, for debugging.
I meant to include junkdict also.

-- 
Terry Jan Reedy

Re: [Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken

Terry Reedy