On 7/11/2010 11:02 PM, Tim Peters wrote:
The heuristic lowered the reported match ratio from .96 to .88, which would be bad when one wanted the unaltered value.
BTW, it's not clear whether ratio() computes a _useful_ value in the presence of junk, however that may be defined.
I agree, which is one reason why one should be to disable auto-junking. There are a number of statistical methods for analyzing similarity matrices, analogous to correlation matrices, except that entries are in [0.0,1.0] rather than [-1.0,1.0]. For my Ph.D. thesis, I did such analyses for sets of species. Similarity measures should ususally be symmetric and increase with greater matching. The heuristic can cause both to fail.
I suspect nobody cares ;-)
There are multiple possible definitions of similarity for sets (and arguments thereabout). I am sure the same is true for sequences. But I consider the definition for .ratio, without the heuristic, to be sensible. I would consider using it should the occasion arise.
It certainly was the intent that nothing would be called junk unless it appeared at least twice, so the "n>= 200" clause ensures that 1% of n is at least 2.
Since 2 cannot be greater than something that is at least 2, you ensured that nothing would be called junk unless it appears as least thrice.
However, I'm wary of introducing a generalization in the absence of experience saying people would use it. Is this the right kind of parametrization? Is this even the right kind of way to go about auto-detecting junk? I know it worked great for the original use case that drove it, but I haven't seen anyone say they want a notion of auto-junk detection for other uses - just that they _don't_ want the wholly inappropriate current auto-junk detection in (some or all) of their uses.
IOW, it's hard to generalize confidently from a sample of one :-(
Implementation: Add a new parameter named 'common' or 'threshold' or whatever that defaults to 1.
I'd call it "autojunk", cuz that's what it would do. Also a useful syntactic overlap with the name of the current "isjunk" argument.
I like that. I am now leaning toward the following? G (I hope, this time, for 'go' ;-). For 2.7.1, 3.1.3, and 3.2, add 'autojunk = True' to the constructor signature. This is the minimal change that fixes the bug of no choice while keeping the default as is. So it is a minimal violation of the usual stricture against API changes in bugfix releases. I would doc this as "Use an internal heuristic that identifies 'common' items as junk." and separately describe the 'current heuristic', leaving open the possibility of changing it. Possible suboption: enforce 'autojunk in (True,False)' so the user cannot forget that it is a switch and not a tuning parameter. In 3.2, expose as an attribute a tuple 'hueristic' or '_heuristic' with the tuning parameters for the current heuristic. Adding the _ would indicate that is it a private, expert-only, use at your own risk, subject to change attribute. If we agree on this much, we can then discuss what the tuple should be for 3.2.
Other changes that apply regardless of the heuristic/api change:
Update the code to use sets (newer than difflib) instead of dicts with values set to 1.
Directly expose the set of 'common' items as an additional attribute of SequenceMatcher instances. Such instance attributes are currently undocumented, so adding one can hardly be a problem. Add documention thereof. Being able to see the effect of the heuristic when it is not turned off might help people decide whether or not to use it, or how to tune the threshold for smallish alphabets where 1 is too small.
Wholly agreed. junkdict (after turning it into a set) should also be exposed - when someone passes in a fancy regexp matcher for the isjunk argument, they can be surprised at what their regexp matches. Being able to see the results can be helpful there too, for debugging.
I meant to include junkdict also. -- Terry Jan Reedy