Issue 2986: difflib.SequenceMatcher is partly broken
[Also posted to http://bugs.python.org/issue2986 Developed with input from Eli Bendersky, who will write patchfile(s) for whichever change option is chosen.] Summary: difflib.SeqeunceMatcher was developed, documented, and originally operated as "a flexible class for comparing pairs of sequences of any [hashable] type". An "experimental" heuristic was added in 2.3a1 to speed up its application to sequences of code lines, which are selected from an unbounded set of possibilities. As explained below, this heuristic partly to completely disables SequenceMatcher for realistic-length sequences from a small finite alphabet. The regression is easy to fix. The docs were never changed to reflect the effect of the heuristic, but should be, with whatever additional change is made. In the commit message for revision 26661, which added the heuristic, Tim Peters wrote "While I like what I've seen of the effects so far, I still consider this experimental. Please give it a try!" Several people who have tried it discovered the problem with small alphabets and posted to the tracker. Issues #1528074, #1678339. #1678345, and #4622 are now-closed duplicates of #2986. The heuristic needs revision. Open questions (discussed after the examples): what exactly to do, which versions to do it too, and who will do it. --- Some minimal difference examples: from difflib import SequenceMatcher as SM # base example print(SM(None, 'x' + 'y'*199, 'y'*199).ratio()) # should be and is 0.9975 (rounded) # make 'y' junk print(SM(lambda c:c=='y', 'x' + 'y'*199, 'y'*199).ratio()) # should be and is 0.0 # Increment b by 1 char print(SM(None, 'x' + 'y'*199, 'y'*200).ratio()) # should be .995, but now is 0.0 because y is treated as junk # Reverse a and b, which increments b print(SM(None, 'y'*199, 'x' + 'y'*199).ratio()) # should be .9975, as before, but now is 0.0 because y is junked The reason for the bug is the heuristic: if the second sequence is at least 200 items long then any item occurring more than one percent of the time in the second sequence is treated as junk. This was aimed at recurring code lines like 'else:' and 'return', but can be fatal for small alphabets where common items are necessary content. A more realistic example than the above is comparing DNA gene sequences. Without the heuristic SequenceMatcher.get_opcodes() reports an appropriate sequence of matches and edits and .ratio works as documented and expected. For 1000/2000/6000 bases, the times on a old Athlon 2800 machine are <1/2/12 seconds. Since 6000 is longer than most genes, this is a realistic and practical use. With the heuristic, everything is junk and there is only one match, ''=='' augmented by the initial prefix of matching bases. This is followed by one edit: replace the rest of the first sequence with the rest of the second sequence. A much faster way to find the first mismatch would be i = 0 while first[i] == second[i]: i+=1 The match ratio, based on the initial matching prefix only, is spuriously low. --- Questions: 1: what change should be make. Proposed fix: Disentangle the heuristic from the calculation of the internal b2j dict that maps items to indexes in the second sequence b. Only apply the heuristic (or not) afterward. Version A: Modify the heuristic to only eliminate common items when there are more than, say, 100 items (when len(b2j)> 100 where b2j is first calculated without popularity deletions). The would leave DNA, protein, and printable ascii+[\n\r\t] sequences alone. On the other hand, realistic sequences of more than 200 code lines should have at least 100 different lines, and so the heuristic should continue to be applied when it (mostly?) 'should' be. This change leaves the API unchanged and does not require a user decision. Version B: add a parameter to .__init__ to make the heuristic optional. If the default were True ('use it'), then the code would run the same as now (even when bad). With the heuristic turned off, users would be able to get the .ratio they may expect and need. On the other hand, users would have to understand the heuristic to know when and when not to use it. Version C: A more radical alternative would be to make one or more of the tuning parameters user settable, with one setting turning it off. 2. What type of issue is this, and what version get changed. I see the proposal as partial reversion of a change that sometimes causes a regression, in order to fix the regression. Such would usually be called a bugfix. Other tracker reviewers claim this issue is a feature request, not a bugfix. Either way, 3.2 gets the fix. The practical issue is whether at least 2.7(.1) should get the fix, or whether the bug should forever continue in 2.x. 3. Who will make the change. Eli will write a patch and I will check it. However, Georg Brandel assigned the issue to Tim Peters, with a request for comment, but Tim never responded. Is there an active committer who will grab the issue and do a commit review when a patch is ready? -- Terry Jan Reedy
On Tue, Jul 6, 2010 at 7:18 PM, Terry Reedy <tjreedy@udel.edu> wrote:
[Also posted to http://bugs.python.org/issue2986 A much faster way to find the first mismatch would be i = 0 while first[i] == second[i]: i+=1 The match ratio, based on the initial matching prefix only, is spuriously low.
I don't have much experience with the Python sequence matcher, but many classical edit distance and alignment algorithms benefit from stripping any common prefix and suffix before engaging in heavy-lifting. This is trivially optimal for Hamming-like distances and easily shown to be for Levenshtein and Damerau type distances. -Kevin
[Terry Reedy]
[Also posted to http://bugs.python.org/issue2986 Developed with input from Eli Bendersky, who will write patchfile(s) for whichever change option is chosen.]
Thanks for paying attention to this, Terry (and Ed)! I somehow managed to miss the whole discussion over the intervening years :-(
Summary: difflib.SeqeunceMatcher was developed, documented, and originally operated as "a flexible class for comparing pairs of sequences of any [hashable] type".
Although it can be used for that, its true intent was to produce intuitive diffs for revisions of text files (including code files) edited by humans. Where "intuitive" means approximately "less jarring than the often-odd diffs produced by algorithms working on some rigid mathematical notion of 'minimal edit distance'". Whether it's useful for more than that I can't say, because that's all I ever developed (or used) the algorithm for.
An "experimental" heuristic was added in 2.3a1
Big bad on me for that! At the time I fully intended to document that, and at least make it tunable, but life intervened and it got dropped on the floor.
to speed up its application to sequences of code lines,
Yes, that was the intent. I was corresponding with a user at the time who had odd notions (well, by my standards) of how to format C code, which left him with many hundreds of lines containing only an open brace, or a close brace, or just a semicolon (etc). difflib spun its wheels frantically trying to sort this out, and the heuristic in question cut processing time from hours (in the worst cases) to seconds. Since that (text file comparison) was always the primary case for this class, it was worth doing something about. But it should not have gone in the way it did (undocumented & unfinished, as you correctly note).
which are selected from an unbounded set of possibilities. As explained below, this heuristic partly to completely disables SequenceMatcher for realistic-length sequences from a small finite alphabet.
Which wasn't an anticipated use case, so should not be favored. Slowing down difflib for what it was intended for is not a good idea - practicality beats purity. Ya, ya, I understand someone playing around with DNA sequences might find difflib tempting at first, but fix this and they're still going to be unhappy. There are much better (faster, "more standard") algorithms for comparing sequences drawn from tiny alphabets, and especially so for DNA matching.
The regression is easy to fix. The docs were never changed to reflect the effect of the heuristic, but should be, with whatever additional change is made.
True - and always was.
In the commit message for revision 26661, which added the heuristic, Tim Peters wrote "While I like what I've seen of the effects so far, I still consider this experimental. Please give it a try!" Several people who have tried it discovered the problem with small alphabets and posted to the tracker. Issues #1528074, #1678339. #1678345, and #4622 are now-closed duplicates of #2986. The heuristic needs revision.
Open questions (discussed after the examples): what exactly to do, which versions to do it too, and who will do it.
--- Some minimal difference examples:
from difflib import SequenceMatcher as SM
# base example print(SM(None, 'x' + 'y'*199, 'y'*199).ratio()) # should be and is 0.9975 (rounded)
# make 'y' junk print(SM(lambda c:c=='y', 'x' + 'y'*199, 'y'*199).ratio()) # should be and is 0.0
# Increment b by 1 char print(SM(None, 'x' + 'y'*199, 'y'*200).ratio()) # should be .995, but now is 0.0 because y is treated as junk
# Reverse a and b, which increments b print(SM(None, 'y'*199, 'x' + 'y'*199).ratio()) # should be .9975, as before, but now is 0.0 because y is junked
The reason for the bug is the heuristic: if the second sequence is at least 200 items long then any item occurring more than one percent of the time in the second sequence is treated as junk. This was aimed at recurring code lines like 'else:' and 'return', but can be fatal for small alphabets where common items are necessary content.
Indeed, it makes no sense at all for tiny alphabets. OTOH, as above, it gave factor-of-thousands speedups for intended use cases, and that's more important to me. There should certainly be a way to turn off the "auto junk" heuristic, and to tune it, but - sorry for being pragmatic ;-) - it was a valuable speed improvement for what I expect still remain difflib's overwhelmingly most common use cases.
A more realistic example than the above is comparing DNA gene sequences.
Comparing DNA sequences is realistic, but using SequenceMatcher to do so is unrealistic except for a beginner just playing with the ideas. There should be a way to disable the heuristic so the beginner can have their fun, but any serious work in this area will need to use different algorithms.
Without the heuristic SequenceMatcher.get_opcodes() reports an appropriate sequence of matches and edits and .ratio works as documented and expected. For 1000/2000/6000 bases, the times on a old Athlon 2800 machine are <1/2/12 seconds. Since 6000 is longer than most genes, this is a realistic and practical use.
Did I just call you a beginner just playing with the ideas? Could be ;-) People doing serious work in this area employ a great many other approaches specifically designed for DNA matching: http://en.wikipedia.org/wiki/Sequence_alignment and have developed dozens of software packages to implement them: http://en.wikipedia.org/wiki/Sequence_alignment_software
With the heuristic, everything is junk and there is only one match, ''=='' augmented by the initial prefix of matching bases. This is followed by one edit: replace the rest of the first sequence with the rest of the second sequence. A much faster way to find the first mismatch would be i = 0 while first[i] == second[i]: i+=1 The match ratio, based on the initial matching prefix only, is spuriously low.
The details don't really matter to the conclusion: by definition, "junk" is ignored for purposes of finding a match. Therefore any method of automatically calling something "junk" is going to screw up _some_ use case. I agree that's unacceptable - there must be a way to turn that off. But, again, it proved valuable for SequenceMatcher's _primary_ use case, so keeping it enabled by default is the most sensible thing to do.
--- Questions:
1: what change should be make.
Proposed fix: Disentangle the heuristic from the calculation of the internal b2j dict that maps items to indexes in the second sequence b. Only apply the heuristic (or not) afterward.
Version A: Modify the heuristic to only eliminate common items when there are more than, say, 100 items (when len(b2j)> 100 where b2j is first calculated without popularity deletions).
The would leave DNA, protein, and printable ascii+[\n\r\t] sequences alone. On the other hand, realistic sequences of more than 200 code lines should have at least 100 different lines, and so the heuristic should continue to be applied when it (mostly?) 'should' be. This change leaves the API unchanged and does not require a user decision.
It's a better heuristic, in the sense that it sidesteps a fatal problem in the "tiny alphabet" use case. I'm sure it will leave fatal problems for other (unanticipated) use cases, though. So, an improvement, but not "a solution".
Version B: add a parameter to .__init__ to make the heuristic optional. If the default were True ('use it'), then the code would run the same as now (even when bad). With the heuristic turned off, users would be able to get the .ratio they may expect and need. On the other hand, users would have to understand the heuristic to know when and when not to use it.
Version C: A more radical alternative would be to make one or more of the tuning parameters user settable, with one setting turning it off.
I don't see that they're mutually exclusive. I expect that the best design would have both a better default heuristic _and_ a way to tune or disable it entirely. A & C above.
2. What type of issue is this, and what version get changed.
I see the proposal as partial reversion of a change that sometimes causes a regression, in order to fix the regression. Such would usually be called a bugfix. Other tracker reviewers claim this issue is a feature request, not a bugfix. Either way, 3.2 gets the fix.
It's definitely a bug in the sense that it broke some previously working code. But pull the patch, and the user I corresponded with at the start will see code comparison times leap back from seconds to hours, which effectively makes difflib useless for him, and so breaks _his_ real application (he used difflib in a commercial text editor, BTW). Both are unacceptable - but that's how things are sometimes. I favor the primary use case (text file comparison).
The practical issue is whether at least 2.7(.1) should get the fix, or whether the bug should forever continue in 2.x.
I'm afraid there's no true fix here without an API change (any heuristic for auto-identifying junk will break _some_ use case). Your #A sounds good to me for 2.7, and a combination of #A and #C sounds best for 3.x.
3. Who will make the change.
Eli will write a patch and I will check it. However, Georg Brandel assigned the issue to Tim Peters, with a request for comment, but Tim never responded.
I see he asked In March of 2009 - he should have picked a month I was paying attention - or at least a year ;-)
Is there an active committer who will grab the issue and do a commit review when a patch is ready?
If Guido will lend me his time machine, I'll go back to 2002 and apply the patch then, when it should have been done :-(
[snip]
Yes, that was the intent. I was corresponding with a user at the time who had odd notions (well, by my standards) of how to format C code, which left him with many hundreds of lines containing only an open brace, or a close brace, or just a semicolon (etc). difflib spun its wheels frantically trying to sort this out, and the heuristic in question cut processing time from hours (in the worst cases) to seconds.
Since that (text file comparison) was always the primary case for this class, it was worth doing something about. But it should not have gone in the way it did (undocumented & unfinished, as you correctly note).
which are selected from an unbounded set of possibilities. As explained below, this heuristic partly to completely disables SequenceMatcher for realistic-length sequences from a small finite alphabet.
Which wasn't an anticipated use case, so should not be favored. Slowing down difflib for what it was intended for is not a good idea - practicality beats purity.
Ya, ya, I understand someone playing around with DNA sequences might find difflib tempting at first, but fix this and they're still going to be unhappy. There are much better (faster, "more standard") algorithms for comparing sequences drawn from tiny alphabets, and especially so for DNA matching.
Tim, thanks for your insights. In response to the description above, however, I would like to explain my use case, which originally got me interested in this issue with SequenceMatcher. I was not comparing DNAs, but using SequenceMatcher in my automatic testbench checker that verified the output of some logic design. I didn't want exact comparisons, so I was very happy to see difflib.SequenceMatcher in stdlib, with its useful ratio/quick_ratio functions. I was comparing the output sequence to an expected sequence with a 0.995 ratio threshold and was very happy. Until my sequence got longer than 200 elements... So this isn't DNA, and the alphabet wasn't too tiny, but on the other hand there was nothing in the module to suggest that it should be only used to comparing lines in files. On the contrary, its general-sounding name - SequenceMatcher, lulled me into the (false?) belief that I can just use it for my sequence comparison without worrying about finding better algorithms or implementing stuff like edit distance myself. Judging by the comments in other related issues, I'm far from being the only one. Therefore, I think that you should just admit that your excellent module became useful for more purposes than you originally intended it for :-) !! I completely respect your desire to keep the "intended purposes" as fast as possible, but there are solutions (some of which were presented by Terry) that can make it more useful without any harm to the performance of the intended purpose. As Terry noted, I will be very happy to submit a patch with tests for whatever decision that will be reached by pydev on this matter. Eli
On Tue, 06 Jul 2010 19:18:09 -0400 Terry Reedy <tjreedy@udel.edu> wrote:
Version A: Modify the heuristic to only eliminate common items when there are more than, say, 100 items (when len(b2j)> 100 where b2j is first calculated without popularity deletions).
[...]
Version B: add a parameter to .__init__ to make the heuristic optional.
[...]
Version C: A more radical alternative would be to make one or more of the tuning parameters user settable, with one setting turning it off.
Version B would have my favour (but please make the default be True). Version A can lead to regressions (including performance regressions such as described by Tim), and version C looks far more complicated to use. Regards Antoine.
On Wed, Jul 7, 2010 at 9:18 AM, Terry Reedy <tjreedy@udel.edu> wrote:
In the commit message for revision 26661, which added the heuristic, Tim Peters wrote "While I like what I've seen of the effects so far, I still consider this experimental. Please give it a try!" Several people who have tried it discovered the problem with small alphabets and posted to the tracker. Issues #1528074, #1678339. #1678345, and #4622 are now-closed duplicates of #2986. The heuristic needs revision.
Python 2.3 you say... Hmm, I've been using difflib.SequenceMatcher for years in a serial bit error rate tester (with typical message sizes ranging from tens of bytes to tens of thousands of bytes) that occasionally gives unexpected results. I'd been blaming hardware glitches (and, to be fair, all of the odd results I can recall off the top of my head were definitively traced to problems in the hardware under test), but I should probably check I'm not running afoul of this bug. And Tim, the algorithm may not be optimal as a general purpose binary diff algorithm, but it's still a hell of a lot faster than the hardware I use it to test. Compared to the equipment configuration times, the data comparison time is trivial. There's another possibility here - perhaps the heuristic should be off by default in SequenceMatcher, with a TextMatcher subclass that enables it (and Differ and HtmlDiff then inheriting from the latter)? There's currently barely anything in the SequenceMatcher documentation to indicate that it is designed primarily for comparing text rather than arbitrary sequences (the closest it gets is the reference to Ratcliff/Obserhelp gestalt pattern matching and then the link to the Ratcliff/Metzener Dr Dobb's article - and until this thread, I'd never followed the article link). Rather than reverting to Tim's undocumented vision, perhaps we should better articulate it by separating the general purpose matcher from an optimised text matcher. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
[Nick Coghlan]
... Hmm, I've been using difflib.SequenceMatcher for years in a serial bit error rate tester (with typical message sizes ranging from tens of bytes to tens of thousands of bytes) that occasionally gives unexpected results. I'd been blaming hardware glitches (and, to be fair, all of the odd results I can recall off the top of my head were definitively traced to problems in the hardware under test), but I should probably check I'm not running afoul of this bug.
That would be prudent ;-)
And Tim, the algorithm may not be optimal as a general purpose binary diff algorithm, but it's still a hell of a lot faster than the hardware I use it to test. Compared to the equipment configuration times, the data comparison time is trivial.
I'm all in favor of people using the class for any purpose they find useful. Just saying the overwhelmingly most common use is for comparing text files, and that's important - most important, since most widely used.
There's another possibility here - perhaps the heuristic should be off by default in SequenceMatcher, with a TextMatcher subclass that enables it (and Differ and HtmlDiff then inheriting from the latter)?
Or, to make life easiest for the most common uses, create a subclass that _didn't_ have any notion of "junk" whatsoever. Or a new flag to turn auto-junk heuristics off. Unfortunately, no true solution exists without changing _something_ in the API, and since the behavior changed 8 years ago there's just no guessing how many uses rely on the by-now-long-current behavior.
There's currently barely anything in the SequenceMatcher documentation to indicate that it is designed primarily for comparing text rather than arbitrary sequences (the closest it gets is the reference to Ratcliff/Obserhelp gestalt pattern matching and then the link to the Ratcliff/Metzener Dr Dobb's article - and until this thread, I'd never followed the article link).
It's designed to compare sequences of hashable elements. The use cases that _drove_ the implementation were (1) viewing a file as a sequence of lines; and (2), viewing a line as a sequence of characters. Use cases always drive implementation (although rarely mentioned the docs), but striving for natural generalization sometimes pays off. I expected it would in this case, and that others have found unanticipated uses confirms that it did. Unfortunately, I screwed up by not finishing what I started 8 years ago (adding an auto-junk heuristic that was very valuable in the primary use case, but turned out to have very bad effects in some other cases).
Rather than reverting to Tim's undocumented vision, perhaps we should better articulate it by separating the general purpose matcher from an optimised text matcher.
Having a notion of junk can improve the quality of results (in the sense of making them more intuitive to humans, which was an explicit goal of the module), and can yield enormous speedups. Is this restricted solely to comparing text files? I don't know that to be the case, but offhand doubt it's true. As always, if they exist, we won't hear from people with other use cases who never noticed the change (except perhaps to see improved speed and better results) - not until we "turn off" the heuristic on them, and then they'll view _that_ as "a bug".
I had considered the possibility of option A for 2.7 and A & C for 3.2. But see below. Since posting, I did an experiment with a 700 char paragraph of text (the summary from the post) compared to an 'edited' version. I did the comparision with and without the current heuristic. I did not notice much time difference (a couple of seconds either way) and the edit list was essentially the same. The heuristic lowered the reported match ratio from .96 to .88, which would be bad when one wanted the unaltered value. I do not know which, if any, chars other than 'e' were junked as that info currently is not exposed. I propose below that it should be. I intentionally did not list as options D. Keep the status quo that is buggy for certain uses. Good things often have more uses than the inventor intended or expected. They should not be prevented. E. Completely remove the heuristic, which would restore 'buggy' performance for other uses. One of the closed issues was E, rejected for that reason. --- I also did not list one of my original ideas, but after the discussion it is looking better to me. It is based on the following statement of the current heuristic: "Disregard as junk common items that occur in more that 1% of the positions in second sequence b, as long as b is long enough so that duplicates cannot be called common." Tim, I do not know if you remember why you choose 200 as the cutoff, but the above assumes that the following in not just a coincidence: (2 > 199*.01) == True (2 > 200*.01) == False In other words, 200 is the smallest length for b that prevents the junking of duplicates. F. Generalize the heuristic by replacing '1' with 'k', where k can be None (no heuristic) or 1-99. If not None, replace 200 by 200/k to minimally avoid junking of duplicates. If len(b) >= 200/k, then item counts should be greater than (len(b)*k)//100, which need only be calculated once. Implementation: Add a new parameter named 'common' or 'threshold' or whatever that defaults to 1. After computing b2j without the heuristic, if 'common' is not None, prune b2j as described above. My thinking here is that a user either knows or can easily find out the length of b and the size of the intented or actual alphabet of b (the latter is len(set(b)). So the user can conditionally call SequenceMatcher with 'common' set to None or an int as appropriate, perhaps after some experimentation. So the threshold is the only tuning parameter actually needed, and it also allows the heuristic to be turned off. The reason I did not list this before is the problem with 2.7.1. F, unlike option A, overtly changes the api, and some would object to that for 2.7 even though is really is a bugfix. However, option F will not not break code while the covert change of option A could break code. So this may be the best option for a bad situation. It is a minimal change that gives the user complete flexibility. In other words, I see three options for 2.7.1+: D. keep the current sometimes buggy behavior A. covertly change the heuristic to mostly fix it but possibly break some uses. F. add a parameter, with a no-change default, that allows the user to select a change if and when wanted. Another advantage of F is that then 2.7 and 3.2 would get the same change. -- Other changes that apply regardless of the heuristic/api change: Update the code to use sets (newer than difflib) instead of dicts with values set to 1. Directly expose the set of 'common' items as an additional attribute of SequenceMatcher instances. Such instance attributes are currently undocumented, so adding one can hardly be a problem. Add documention thereof. Being able to see the effect of the heuristic when it is not turned off might help people decide whether or not to use it, or how to tune the threshold for smallish alphabets where 1 is too small. -- Terry Jan Reedy
On Wed, 07 Jul 2010 21:04:17 -0400 Terry Reedy <tjreedy@udel.edu> wrote:
In other words, I see three options for 2.7.1+:
[...] I don't think 2.7 should get any change at all here. Only 3.2 should be modified. As Tim said, difflib works ok for its intended use (regular text diffs). Making it work for other uses is a new feature, not a bugfix. Regards Antoine.
[Antoine Pitrou]
I don't think 2.7 should get any change at all here. Only 3.2 should be modified. As Tim said, difflib works ok for its intended use (regular text diffs).
That was the use case that drove the implementation, but it's going too far to say that was the only "intended" case. I believe (but can't prove) that remains the most common use (& overwhelmingly so), but it was indeed _intended_ to work for any sequences of hashable elements. And it always did, and it still does, in the sense that it computes a diff that transforms the first sequence into the second sequence. The problem is that I introduced a heuristic speedup with the primary use case in mind that turned out to vastly damage the _quality_ of the results for some other uses (a correct diff isn't necessarily a useful diff - for example, "delete the entire sequence you started with, then insert the entire new sequence" is a correct diff for any pair of input sequences, but not a useful diff for most purposes).
Making it work for other uses is a new feature, not a bugfix.
Definitely not a new feature. These other cases used to deliver much better diffs, before I introduced the heuristic in question. People with these other cases are asking for a way to get the results they used to get - and we know that's so because a few figured out they get what they want just by (in effect) reverting the checkin (made about 8 years ago) that _introduced_ the heuristic. So they're looking for a way to restore older behavior, not to introduce new behavior. Of course this is obscured by that the change happened so long ago that I bet most of them don't know at first that it _was_ the old behavior.
[Terry Reedy]
I had considered the possibility of option A for 2.7 and A & C for 3.2. But see below.
Since posting, I did an experiment with a 700 char paragraph of text (the summary from the post) compared to an 'edited' version. I did the comparision with and without the current heuristic. I did not notice much time difference (a couple of seconds either way) and the edit list was essentially the same.
When the sequences contain M (in seq1) and N (in seq2) instances of a given element, the inner matching loop goes around M*N times processing them. Unless the element is identified as "junk" and/or "popular", in which case the outer loop finds the M instances (in seq1), one at a time, and the inner loop is skipped for each (because b2j does not contain the junk/popular element as a key, so the inner loop ends up iterating over the empty "nothing" list). Of course the speed difference is minor unless M*N is large. For example, when M and N are both on the order of thousands, M*N is on the order of millions. Such cases motivated the original "auto junk" heuristic (large "before" and "after" text files with thousands of repetitions of boilerplate lines - I suspect at least some of this input was computer-generated).
The heuristic lowered the reported match ratio from .96 to .88, which would be bad when one wanted the unaltered value.
Except that the ratio is defined as being twice the number of matches divided by the sum of the lengths of the sequences, so has no interpretation independent of exactly how matching was performed. Junk can have a very large effect on this. IOW, the ratio is not a property of the sequences, it emerges from all the details of how matches were found. BTW, it's not clear whether ratio() computes a _useful_ value in the presence of junk, however that may be defined. Perhaps it would be better to, e.g., not use the lengths of the matches or of the input sequences directly, but to adjust all of them by pretending junk elements didn't exist. Or not - it's unclear. For example, if the sequences are ['j1', 'a', 'j2'] and ['j3', 'a', 'j4'] where j1, j2, j3, and j4 are considered to be junk, what _is_ a useful ratio? As is, it returns 0.333 (1 match, so returns 2.0 * 1 / (3 + 3) = 1/3). But if we didn't count junk elements at all, the "correct" ratio would be 1.0 (2.0 * 1 match, out of 2 total non-junk elements in the input sequences). I suspect nobody cares ;-)
I do not know which, if any, chars other than 'e' were junked as that info currently is not exposed. I propose below that it should be.
An auto-junk-detection gimmick certainly should allow finding out what it decided junk was.
I intentionally did not list as options
D. Keep the status quo that is buggy for certain uses.
Good things often have more uses than the inventor intended or expected. They should not be prevented.
E. Completely remove the heuristic, which would restore 'buggy' performance for other uses.
One of the closed issues was E, rejected for that reason.
--- I also did not list one of my original ideas, but after the discussion it is looking better to me. It is based on the following statement of the current heuristic:
"Disregard as junk common items that occur in more that 1% of the positions in second sequence b, as long as b is long enough so that duplicates cannot be called common."
I'm not sure what "so that duplicates cannot be called common" means. Being common shouldn't, on its own, disqualify something from being junk. The essence of (a better notion of) junk probably has much more to do with "lumpiness" in the distribution of element frequencies. In the DNA example, I bet no one of {A, G, C, T} appears "way more" frequently than any other, so it's crazy to call any of them "junk". But, e.g., if we were trying to compare simple melodies, a "silence" token would probably be very frequent compared to any individual note.
Tim, I do not know if you remember why you choose 200 as the cutoff,
I was afraid the auto-junk heuristic would end up doing silly things on short sequences, and 200 "isn't short" - LOL. Sheesh.
but the above assumes that the following in not just a coincidence:
(2 > 199*.01) == True (2 > 200*.01) == False
In other words, 200 is the smallest length for b that prevents the junking of duplicates.
Well, the rule was also "> 1%", and floating arithmetic isn't used for that part ... it's here: if n >= 200 and len(indices) * 100 > n: where n is the length of the sequence. len(indices) * 100 > n is mathematically (not necessarily numerically) equivalent to len(indices) > n / 100, or, IOW, the thing appears in more than 1% of the positions. It certainly was the intent that nothing would be called junk unless it appeared at least twice, so the "n >= 200" clause ensures that 1% of n is at least 2.
F. Generalize the heuristic by replacing '1' with 'k', where k can be None (no heuristic) or 1-99. If not None, replace 200 by 200/k to minimally avoid junking of duplicates. If len(b) >= 200/k, then item counts should be greater than (len(b)*k)//100, which need only be calculated once.
Since cleverness already screwed people here, I'd rather be more straightforward: allow to specific the percentage as a float in [0.0, 100.0], and also allow to specify a minimum count directly. Like "call something junk if it appears in more than 0.231 of the positions and also appears at least 42 times". That would be a 2-tuple rather than a single number, with default (1.0, 2) to match current behavior. However, I'm wary of introducing a generalization in the absence of experience saying people would use it. Is this the right kind of parametrization? Is this even the right kind of way to go about auto-detecting junk? I know it worked great for the original use case that drove it, but I haven't seen anyone say they want a notion of auto-junk detection for other uses - just that they _don't_ want the wholly inappropriate current auto-junk detection in (some or all) of their uses. IOW, it's hard to generalize confidently from a sample of one :-(
Implementation: Add a new parameter named 'common' or 'threshold' or whatever that defaults to 1.
I'd call it "autojunk", cuz that's what it would do. Also a useful syntactic overlap with the name of the current "isjunk" argument.
After computing b2j without the heuristic, if 'common' is not None, prune b2j as described above.
My thinking here is that a user either knows or can easily find out the length of b and the size of the intented or actual alphabet of b (the latter is len(set(b)). So the user can conditionally call SequenceMatcher with 'common' set to None or an int as appropriate, perhaps after some experimentation. So the threshold is the only tuning parameter actually needed, and it also allows the heuristic to be turned off.
The reason I did not list this before is the problem with 2.7.1. F, unlike option A, overtly changes the api, and some would object to that for 2.7 even though is really is a bugfix.
Shouldn't be a problem for a major release (like 2.7), but would be for a minor (bugfix) release (like 2.7.1).
However, option F will not not break code while the covert change of option A could break code. So this may be the best option for a bad situation. It is a minimal change that gives the user complete flexibility.
Certainly an improvement, just unsure whether the flexibility provided will prove to be useful (beyond the ability to specify None to turn auto-junk off, which is the burning issue).
In other words, I see three options for 2.7.1+: D. keep the current sometimes buggy behavior A. covertly change the heuristic to mostly fix it but possibly break some uses. F. add a parameter, with a no-change default, that allows the user to select a change if and when wanted.
Another advantage of F is that then 2.7 and 3.2 would get the same change.
A very good thing - if it's the right change ;-)
Other changes that apply regardless of the heuristic/api change:
Update the code to use sets (newer than difflib) instead of dicts with values set to 1.
Directly expose the set of 'common' items as an additional attribute of SequenceMatcher instances. Such instance attributes are currently undocumented, so adding one can hardly be a problem. Add documention thereof. Being able to see the effect of the heuristic when it is not turned off might help people decide whether or not to use it, or how to tune the threshold for smallish alphabets where 1 is too small.
Wholly agreed. junkdict (after turning it into a set) should also be exposed - when someone passes in a fancy regexp matcher for the isjunk argument, they can be surprised at what their regexp matches. Being able to see the results can be helpful there too, for debugging.
On 7/11/2010 11:02 PM, Tim Peters wrote:
The heuristic lowered the reported match ratio from .96 to .88, which would be bad when one wanted the unaltered value.
BTW, it's not clear whether ratio() computes a _useful_ value in the presence of junk, however that may be defined.
I agree, which is one reason why one should be to disable auto-junking. There are a number of statistical methods for analyzing similarity matrices, analogous to correlation matrices, except that entries are in [0.0,1.0] rather than [-1.0,1.0]. For my Ph.D. thesis, I did such analyses for sets of species. Similarity measures should ususally be symmetric and increase with greater matching. The heuristic can cause both to fail.
I suspect nobody cares ;-)
There are multiple possible definitions of similarity for sets (and arguments thereabout). I am sure the same is true for sequences. But I consider the definition for .ratio, without the heuristic, to be sensible. I would consider using it should the occasion arise.
It certainly was the intent that nothing would be called junk unless it appeared at least twice, so the "n>= 200" clause ensures that 1% of n is at least 2.
Since 2 cannot be greater than something that is at least 2, you ensured that nothing would be called junk unless it appears as least thrice.
However, I'm wary of introducing a generalization in the absence of experience saying people would use it. Is this the right kind of parametrization? Is this even the right kind of way to go about auto-detecting junk? I know it worked great for the original use case that drove it, but I haven't seen anyone say they want a notion of auto-junk detection for other uses - just that they _don't_ want the wholly inappropriate current auto-junk detection in (some or all) of their uses.
IOW, it's hard to generalize confidently from a sample of one :-(
Implementation: Add a new parameter named 'common' or 'threshold' or whatever that defaults to 1.
I'd call it "autojunk", cuz that's what it would do. Also a useful syntactic overlap with the name of the current "isjunk" argument.
I like that. I am now leaning toward the following? G (I hope, this time, for 'go' ;-). For 2.7.1, 3.1.3, and 3.2, add 'autojunk = True' to the constructor signature. This is the minimal change that fixes the bug of no choice while keeping the default as is. So it is a minimal violation of the usual stricture against API changes in bugfix releases. I would doc this as "Use an internal heuristic that identifies 'common' items as junk." and separately describe the 'current heuristic', leaving open the possibility of changing it. Possible suboption: enforce 'autojunk in (True,False)' so the user cannot forget that it is a switch and not a tuning parameter. In 3.2, expose as an attribute a tuple 'hueristic' or '_heuristic' with the tuning parameters for the current heuristic. Adding the _ would indicate that is it a private, expert-only, use at your own risk, subject to change attribute. If we agree on this much, we can then discuss what the tuple should be for 3.2.
Other changes that apply regardless of the heuristic/api change:
Update the code to use sets (newer than difflib) instead of dicts with values set to 1.
Directly expose the set of 'common' items as an additional attribute of SequenceMatcher instances. Such instance attributes are currently undocumented, so adding one can hardly be a problem. Add documention thereof. Being able to see the effect of the heuristic when it is not turned off might help people decide whether or not to use it, or how to tune the threshold for smallish alphabets where 1 is too small.
Wholly agreed. junkdict (after turning it into a set) should also be exposed - when someone passes in a fancy regexp matcher for the isjunk argument, they can be surprised at what their regexp matches. Being able to see the results can be helpful there too, for debugging.
I meant to include junkdict also. -- Terry Jan Reedy
[Tim]
... BTW, it's not clear whether ratio() computes a _useful_ value in the presence of junk, however that may be defined.
[Terry Reedy]
I agree, which is one reason why one should be to disable auto-junking.
Yup.
There are a number of statistical methods for analyzing similarity matrices, analogous to correlation matrices, except that entries are in [0.0,1.0] rather than [-1.0,1.0]. For my Ph.D. thesis, I did such analyses for sets of species. Similarity measures should ususally be symmetric and increase with greater matching. The heuristic can cause both to fail.
Two things about that. First, ratio() is here _defined_ to be (when comparing sequence a and b): 2.0 * number_of_matches / (len(a) + len(b)) The denominator is not affected by any junk heuristics, so this ratio does indeed increase directly with "greater matching". But I don't know exactly what you mean by "greater matching" - I get the impression you think that's an inherent property of the sequences, but, as before, I don't see any meaning for it independent of the specific matching algorithm used. The current implementation of quick_ratio() may be closer to what you seem to be thinking. That views both sequences as multisets, and considers number_of_matches to be the cardinality of their intersection. That measure ignores the possibility of junk, and also ignores the order in which elements appear - it's wholly independent of the matching algorithm. But it returns 1.0 when the second sequence is any permutation of the elements in the first sequence, so throws away any notion of ordering. Second, it's common as mud for ratio() to give a different result depending on the order of SequenceMatcher's arguments, junk or no junk. This is mostly because it's a good thing for SequenceMatcher to run in finite time ;-) It's not just comparing sequences in the abstract, it's doing so in the context of producing a one-pass "left to right" edit sequence transforming the first sequence into the second, using only "insert" and "delete" operations. If the longest matching block between sequences X and Y is M, X and Y are effectively split into: X = A + M + B Y = C + M + D and then the same process is recursively applied to the pairs (A, C) and (B, D). If, for example, there are many matches between blocks A and D, they'll never be seen - a one-pass edit sequence restricted to "insert" and "delete" has to view M as a wall, transforming A into C entirely before it can skip over the matching M and start thinking about how to transform B into D. For that reason, when there are multiple maximal matching blocks, exactly which one is picked to be M can have percolating effects on how many additional matches are found _later_ (BTW, it's also a reason for why a notion of junk can improve the subjective _quality_ of results as judged by humans: if M is composed of things all of which "look interesting" to people, people tend to think the match is intuitive). The method find_longest_match() uses to pick a winner is documented: Return (i,j,k) such that a[i:i+k] is equal to b[j:j+k], where alo <= i <= i+k <= ahi blo <= j <= j+k <= bhi and for all (i',j',k') meeting those conditions, k >= k' i <= i' and if i == i', j <= j' In other words, of all maximal matching blocks, return one that starts earliest in a, and of all those maximal matching blocks that start earliest in a, return the one that starts earliest in b. This has consequences. Here's a tiny example: X = "abbb" Y = "bbab" All maximal matching blocks between X and Y have length 2. "bb" appears twice in X and once in "Y". "ab" appears once in each. Pass X first, and M="ab" is picked as the winner. That breaks the sequences into: X = "" + "ab" + "bb" Y = "bb" + "ab" + "" and no further matches are found between "" and "bb", or between "bb" and "". ratio() thus returns 0.5. But pass Y first, and M="bb" is picked as the winner, breaking the sequences into: X = "a" + "bb" + "b" Y = "" + "bb" + "ab" and an _additional_ match (on "b") is later found when comparing the suffixes ("b" and "ab"). ratio() thus returns 0.75 in that case. I can conceive of trying all possible winners (and so on recursively), and then backtracking to pick a winner at each branch that maximizes the total number of matches - but "it's a good thing for SequenceMatcher to run in finite time" ;-)
There are multiple possible definitions of similarity for sets (and arguments thereabout). I am sure the same is true for sequences. But I consider the definition for .ratio, without the heuristic, to be sensible. I would consider using it should the occasion arise.
If you ever used difflib's file-comparison functions, they used ratio() extensively under the covers. The most interesting case when producing diffs for human consumption is when two blocks have _no_ matching lines in common. difflib's file comparators look for the two most similar lines (highest ratio()) in that case, trying to guess where (&, later, how) a human may have edited a line. This produces terrifically useful & intuitive output - when it works ;-)
It certainly was the intent that nothing would be called junk unless it appeared at least twice, so the "n>= 200" clause ensures that 1% of n is at least 2.
Since 2 cannot be greater than something that is at least 2, you ensured that nothing would be called junk unless it appears as least thrice.
OK, staring at the code the minimum is actually 4 times. It's true that if n >= 200 and len(indices) * 100 > n: can't succeed when len(indices) is 2, but when it's 3 we've _previously_ seen 3 instances of the element and are currently looking at the fourth. ...
I'd call it "autojunk", cuz that's what it would do. Also a useful syntactic overlap with the name of the current "isjunk" argument.
I like that. I am now leaning toward the following?
G (I hope, this time, for 'go' ;-). For 2.7.1, 3.1.3, and 3.2, add 'autojunk = True' to the constructor signature. This is the minimal change that fixes the bug of no choice while keeping the default as is. So it is a minimal violation of the usual stricture against API changes in bugfix releases.
I'm not current on the rules for bugfix releases, but way back when I'm afraid this wouldn't be allowed. The problem is that anyone using the new argument in 2.7.1 would be creating code that would blow up if run under 2.7. Seems that the 6 people who care ;-) about this case would happily agree to live with that, but I have to defer to those more current on bugfix release practice.
I would doc this as "Use an internal heuristic that identifies 'common' items as junk." and separately describe the 'current heuristic', leaving open the possibility of changing it.
Possible suboption: enforce 'autojunk in (True,False)' so the user cannot forget that it is a switch and not a tuning parameter.
Don't think that part is needed.
In 3.2, expose as an attribute a tuple 'hueristic' or '_heuristic' with the tuning parameters for the current heuristic. Adding the _ would indicate that is it a private, expert-only, use at your own risk, subject to change attribute.
If we agree on this much, we can then discuss what the tuple should be for 3.2.
Think the most pressing thing is to give people a way to turn the damn thing off. An ugly way would be to trigger on an unlikely input-output behavior of the existing isjunk argument. For example, if isjunk("what's the airspeed velocity of an unladen swallow?") returned "don't use auto junk!" and 2.7.1 recognized that as meaning "don't use auto junk", code could be written under 2.7.1 that didn't blow up under 2.7. It could _behave_ differently, although that's true of any way of disabling the auto-junk heuristics.
Summary: adding an autojunk heuristic to difflib without also adding a way to turn it off was a bug because it disabled running code. 2.6 and 3.1 each have, most likely, one final version each. Don't fix for these but add something to the docs explaining the problem and future fix. 2.7 will have several more versions over several years and will be used by newcomers who might encounter the problem but not know to diagnose it and patch a private copy of the module. So it should have a fix. Solutions thought of so far. 1. Modify the heuristic to somewhat fix the problem. Bad (unacceptable) because this would silently change behavior and could break tests. 2. Add a parameter that defaults to using the heuristic but allows turning it off. Perhaps better, but code that used the new API would crash if run on 2.7.0 3. Tim Peters
Think the most pressing thing is to give people a way to turn the damn thing off. An ugly way would be to trigger on an unlikely input-output behavior of the existing isjunk argument. For example, if
isjunk("what's the airspeed velocity of an unladen swallow?")
returned
"don't use auto junk!"
and 2.7.1 recognized that as meaning "don't use auto junk", code could be written under 2.7.1 that didn't blow up under 2.7. It could _behave_ differently, although that's true of any way of disabling the auto-junk heuristics.
Ugly, but perhaps crazy brilliant. Use of such a hack would obviously be temporary. Perhaps its use could be made to issue a -3 warning if such were enabled. I would simplify the suggestion to something like isjunk("disable!heuristic") == True so one could pass lambda s:s=="disable!heuristic" It should be something easy to document and write. This issue is the only place such a string should appear, so it should be safe. Tim and Antoine: if you two can agree on what to do for 2.7, Eli and I will code it. This suggestion amounts to a suggestion that the fix for 2.7 be decoupled from a better fix for 3.2. I agree. The latter can be discussed once 2.7 is settled. [copied to the tracker] -- Terry Jan Reedy
On Wed, 14 Jul 2010 11:45:25 am Terry Reedy wrote:
Summary: adding an autojunk heuristic to difflib without also adding a way to turn it off was a bug because it disabled running code.
2.6 and 3.1 each have, most likely, one final version each. Don't fix for these but add something to the docs explaining the problem and future fix.
2.7 will have several more versions over several years and will be used by newcomers who might encounter the problem but not know to diagnose it and patch a private copy of the module. So it should have a fix. Solutions thought of so far.
1. Modify the heuristic to somewhat fix the problem. Bad (unacceptable) because this would silently change behavior and could break tests.
2. Add a parameter that defaults to using the heuristic but allows turning it off. Perhaps better, but code that used the new API would crash if run on 2.7.0
3. Tim Peters [... snip crazy scheme...]
4. I normally dislike global flags, but this is one time it might be less-worse than the alternatives. Modify SequenceMatcher to test for the value of a global flag, defaulting to False if it doesn't exist. try: disable = disable_heuristic except NameError: disable = False if disable: # use the new, correct behaviour else: # stick to the old, backwards-compatible but wrong behaviour The flag will only exist if the caller explicitly creates it: import difflib difflib.disable_heuristic = True On 2.7.0 and older versions, creating the flag won't do anything useful, but nor will it cause an exception. It will be harmless. I think an explicit flag is better than relying on magic behaviour triggered by "unlikely" input. -- Steven D'Aprano
On Wed, Jul 14, 2010 at 10:38 PM, Steven D'Aprano <steve@pearwood.info> wrote:
4. I normally dislike global flags, but this is one time it might be less-worse than the alternatives.
Modify SequenceMatcher to test for the value of a global flag, defaulting to False if it doesn't exist.
try: disable = disable_heuristic except NameError: disable = False if disable: # use the new, correct behaviour else: # stick to the old, backwards-compatible but wrong behaviour
The flag will only exist if the caller explicitly creates it:
import difflib difflib.disable_heuristic = True
On 2.7.0 and older versions, creating the flag won't do anything useful, but nor will it cause an exception. It will be harmless.
I think an explicit flag is better than relying on magic behaviour triggered by "unlikely" input.
Why make it a global? A property on the SequenceMatcher object should be able to do the right thing. @property def use_autojunk(self): if self._use_autojunk is None: raise AttributeError("...") # Same error as 2.7.0 would raise return self._use_autojunk @use_autojunk.set def use_autojunk(self, val): self._use_autojunk = bool(val) Then use the self._use_autojunk flag to decide whether or not to use the heuristic on 2.7.1+. Code that sets the flag would behave the same on both 2.7.1+ and on 2.7.0, it would just fail to turn the heuristic off in 2.7.0. Code that only read the flag without setting it would also behave the same on the whole 2.7.x series (i.e. raising AttributeError if the flag had not been explicitly set by the application) A flag on the object rather than a per-call flag may actually be the better API here anyway. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 7/14/2010 9:45 AM, Nick Coghlan wrote:
Code that sets the flag would behave the same on both 2.7.1+ and on 2.7.0, it would just fail to turn the heuristic off in 2.7.0.
Antoine Pitrou pointed out on the tracker http://bugs.python.org/issue2986 that such code would *not* 'behave the same'. It would return different results (thus failing tests, if any). If, as would usually be the case, it *needs* the heuristic turned off to get sensible results, it would fail silently and deceptively. He thinks it better to fail noisily with an exception. Any code that depends on an x.y.z bugfix for proper operation and does not include conditional workaround for prior releases, will fail when run on prior releases. Whether the failure is a bad result (silent) or bad exception (noisy) usually depends on what the buggy behavior was. This case is an exception in that a. There is no workaround. b. A complete fix requires some api change. c. We have a choice of changes; some make failures on previous x.y releases silent (as they have been), others make them noisy -- without doing anything extra to make them noisy[*]. d. Hence we have a choice of failure mode. Hence the exceptional debate. [*] For the normal case of bad behavior for some input, we change the code to give the correct behavior. Raising an exception with previous releases would require extra conditional code. Here, by the nature of heuristic, there is no completely correct fix except to be able to turn it off. (If there were a correct deterministic rule, it would not be a heuristic ;=). The mechanism for turning it off might or might not raise an exception 'for free', with no added code for that. -- Terry Jan Reedy
[Steven D'Aprano]
4. I normally dislike global flags, but this is one time it might be less-worse than the alternatives.
Modify SequenceMatcher to test for the value of a global flag, defaulting to False if it doesn't exist. ... The flag will only exist if the caller explicitly creates it:
import difflib difflib.disable_heuristic = True ...
A module global is a non-starter for me (for all the usual reasons about the potentially horrid sociopathic consequences of one piece of an app changing behavior for the entire app - fine if it's a small app wholly understood by the user, potential nightmares if it's a large app nobody understands <0.3 wink>). [Nick Coghlan]
Why make it a global? A property on the SequenceMatcher object should be able to do the right thing. ...
The pragmatic drawback is that the heuristics we're trying to disable typically execute _during_ SequenceMatcher instance construction. That's why it makes most sense to convey the information to the constructor. Setting a property after the instance is created is typically too late (unless several other SequenceMatcher methods are rewritten to delay analyzing the second sequence).
A flag on the object rather than a per-call flag may actually be the better API here anyway.
The call in question here is the constructor (__init__), so there's no real difference between "on the object" and "per call" in this case.
On Thu, Jul 15, 2010 at 6:40 AM, Tim Peters <tim.peters@gmail.com> wrote:
The call in question here is the constructor (__init__), so there's no real difference between "on the object" and "per call" in this case.
You're right, I was misremembering how SequenceMatcher works. Terry's summary of the situation seems correct to me - adding a new flag to the constructor signature would mean we're taking a silent failure ("the heuristic makes my code give the wrong answer on 2.7.0") and making it a noisy failure ("my code needs to be able to turn the heuristic off to get the right answer, so it will fail noisily on 2.7.0"). That's a far cry from the True/False mistake. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
[Nick Coghlan]
You're right, I was misremembering how SequenceMatcher works.
Terry's summary of the situation seems correct to me - adding a new flag to the constructor signature would mean we're taking a silent failure ("the heuristic makes my code give the wrong answer on 2.7.0") and making it a noisy failure ("my code needs to be able to turn the heuristic off to get the right answer, so it will fail noisily on 2.7.0").
Yup - exactly so.
That's a far cry from the True/False mistake.
Which I'm sure refers to introducing True and False builtins in a bugfix release. That was _almost_ as bad as the sometimes-hated heuristic I added to SequenceMatcher ;-) Me, I like "fail noisily on 2.7.0" for this case - practicality beats purity, especially when so few people and programs are likely to be affected by the backwards-incompatible-but-only-for-those-who-need-to-use-it SequenceMatcher constructor signature change.
On 7/14/2010 7:32 PM, Tim Peters wrote:
[Nick Coghlan]
You're right, I was misremembering how SequenceMatcher works.
Terry's summary of the situation seems correct to me - adding a new flag to the constructor signature would mean we're taking a silent failure ("the heuristic makes my code give the wrong answer on 2.7.0") and making it a noisy failure ("my code needs to be able to turn the heuristic off to get the right answer, so it will fail noisily on 2.7.0").
Yup - exactly so.
That's a far cry from the True/False mistake.
Which I'm sure refers to introducing True and False builtins in a bugfix release. That was _almost_ as bad as the sometimes-hated heuristic I added to SequenceMatcher ;-)
Me, I like "fail noisily on 2.7.0" for this case - practicality beats purity, especially when so few people and programs are likely to be affected by the backwards-incompatible-but-only-for-those-who-need-to-use-it SequenceMatcher constructor signature change.
Then tomorrow I will work with Eli on a suggested 2.6/3.1 doc patch and 2.7 code+doc patch. -- Terry Jan Reedy
Rather than reverting to Tim's undocumented vision, perhaps we should better articulate it by separating the general purpose matcher from an optimised text matcher.
For what it's worth, my benchmarking showed that modifying the heuristic to only kick in when there are more than 100 kinds of elements (Terry's option A) didn't affect the runtime of matching whatsoever, even when the heuristic *does* kick in. All it adds, really, is the overhead of a single 'if' statement. So it wouldn't be right to assume that somehow modifying the heuristic or allowing to turn it off will negatively affect performance in the special case Tim originally optimized for. Eli
On Wed, 7 Jul 2010 19:44:31 +0200 Eli Bendersky <eliben@gmail.com> wrote:
For what it's worth, my benchmarking showed that modifying the heuristic to only kick in when there are more than 100 kinds of elements (Terry's option A) didn't affect the runtime of matching whatsoever, even when the heuristic *does* kick in. All it adds, really, is the overhead of a single 'if' statement. So it wouldn't be right to assume that somehow modifying the heuristic or allowing to turn it off will negatively affect performance in the special case Tim originally optimized for.
Just because it doesn't affect performance in your tests doesn't mean it won't do so in the general case. Consider a case where Tim's junk optimization kicked in and helped improve performance a lot, but where there are still less than 100 alphabet symbols. The new heuristic will ruin this use case. That's why I'm advocating a dedicated flag instead.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Antoine Pitrou wrote:
On Wed, 7 Jul 2010 19:44:31 +0200 Eli Bendersky <eliben@gmail.com> wrote:
For what it's worth, my benchmarking showed that modifying the heuristic to only kick in when there are more than 100 kinds of elements (Terry's option A) didn't affect the runtime of matching whatsoever, even when the heuristic *does* kick in. All it adds, really, is the overhead of a single 'if' statement. So it wouldn't be right to assume that somehow modifying the heuristic or allowing to turn it off will negatively affect performance in the special case Tim originally optimized for.
Just because it doesn't affect performance in your tests doesn't mean it won't do so in the general case. Consider a case where Tim's junk optimization kicked in and helped improve performance a lot, but where there are still less than 100 alphabet symbols. The new heuristic will ruin this use case.
That would describe pretty much every C program ever written, for instance, and nearly as high a percentage of all Python modules / scripts ever written: the 'string.printable' set, less formfeed and vertical tab, is 98 characters long.
That's why I'm advocating a dedicated flag instead.
+1. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkw032wACgkQ+gerLs4ltQ5xigCfVLhTzFX733cZAO2Jv6JZQm0i HoIAmQEnTyxa2oLAuE22M7FZHUS00xDu =WYt2 -----END PGP SIGNATURE-----
On 7/7/2010 4:11 PM, Tres Seaver wrote:
Antoine Pitrou wrote:
On Wed, 7 Jul 2010 19:44:31 +0200 Eli Bendersky<eliben@gmail.com> wrote:
For what it's worth, my benchmarking showed that modifying the heuristic to only kick in when there are more than 100 kinds of elements (Terry's option A) didn't affect the runtime of matching whatsoever, even when the heuristic *does* kick in. All it adds, really, is the overhead of a single 'if' statement. So it wouldn't be right to assume that somehow modifying the heuristic or allowing to turn it off will negatively affect performance in the special case Tim originally optimized for.
Just because it doesn't affect performance in your tests doesn't mean it won't do so in the general case. Consider a case where Tim's junk optimization kicked in and helped improve performance a lot, but where there are still less than 100 alphabet symbols. The new heuristic will ruin this use case.
That would describe pretty much every C program ever written, for instance, and nearly as high a percentage of all Python modules / scripts ever written: the 'string.printable' set, less formfeed and vertical tab, is 98 characters long.
In the primary use case, programs are compared by line, not characters, and there are more than 100 different lines in any sensible program of at least 200 lines. -- Terry Jan Reedy
participants (8)
-
Antoine Pitrou
-
Eli Bendersky
-
Kevin Jacobs <jacobs@bioinformed.com>
-
Nick Coghlan
-
Steven D'Aprano
-
Terry Reedy
-
Tim Peters
-
Tres Seaver