[Reposted to python-dev!]
We've has done some customizations to difflib to make it work well with pagetests we are running on a project at Canonical, and we are looking for some guidance as to what's the best way to do them. There are some tricky bits that have to do with how the class inheritance is put together, and since we would want to avoid duplicating difflib I figured we'd ask and see if some grand ideas come up.
A [rough first cut of the] patch is inlined below. Essentially, it does:
- Implements a custom Differ.fancy_compare function that supports ellipsis and omits equal content
- Hacks _fancy_replace to skip ellipsis as well.
- Hacks best_ratio and cutoff. I'm a bit fuzzy on why this was changed, to be honest, and Celso's travelling today, but IIRC it had to do with how difflib grouped changes.
Essentially, what we aim for is:
- Ignoring ellipsisized(!) content - Omitting content which is equal
I initially thought the best way to do this would be to inherit from SequenceMatcher and make it not return opcodes for ellipsis. However, there is no easy way to replace the class short of rewriting major bits of Differ. I suspect this could be easily changed to use a class attribute that we could override, but let me know what you think of the whole thing.
--- /usr/lib/python2.3/difflib.py 2004-11-18 20:05:38.720109040 -0200 +++ difflib.py 2004-11-18 20:24:06.731665680 -0200 @@ -885,6 +885,45 @@ for line in g: yield line
+ def fancy_compare(self, a, b): + """ + >>> import difflib + >>> engine = difflib.Differ() + >>> got = ['World is Cruel', 'Dudes are Cool'] + >>> want = ['World ... Cruel', 'Dudes ... Cool'] + >>> list(engine.fancy_compare(want, got)) +  + + """ + cruncher = SequenceMatcher(self.linejunk, a, b) + for tag, alo, ahi, blo, bhi in cruncher.get_opcodes(): + + if tag == 'replace': + ## replace single line + if a[alo:ahi].rstrip() == '...' and ((ahi - alo) == 1): + g = None + ## two lines replaced + elif a[alo:ahi].rstrip() == '...' and ((ahi - alo) > 1): + g = self._fancy_replace(a, (ahi - 1), ahi, + b, (bhi - 1), bhi) + ## common + else: + g = self._fancy_replace(a, alo, ahi, b, blo, bhi) + elif tag == 'delete': + g = self._dump('-', a, alo, ahi) + elif tag == 'insert': + g = self._dump('+', b, blo, bhi) + elif tag == 'equal': + # do not show anything + g = None + else: + raise ValueError, 'unknown tag ' + `tag` + + if g: + for line in g: + yield line + + def _dump(self, tag, x, lo, hi): """Generate comparison results for a same-tagged range.""" for i in xrange(lo, hi): @@ -926,7 +965,13 @@
# don't synch up unless the lines have a similarity score of at # least cutoff; best_ratio tracks the best score seen so far - best_ratio, cutoff = 0.74, 0.75 + #best_ratio, cutoff = 0.74, 0.75 + + ## reduce the cutoff to have enough similarity + ## between '<something> ... <something>' and '<a> blabla </a>' + ## for example + best_ratio, cutoff = 0.009, 0.01 + cruncher = SequenceMatcher(self.charjunk) eqi, eqj = None, None # 1st indices of equal lines (if any)
@@ -981,7 +1026,11 @@ cruncher.set_seqs(aelt, belt) for tag, ai1, ai2, bj1, bj2 in cruncher.get_opcodes(): la, lb = ai2 - ai1, bj2 - bj1 - if tag == 'replace': + + if aelt[ai1:ai2] == '...': + return + + if tag == 'replace': atags += '^' * la btags += '^' * lb elif tag == 'delete':
Take care, -- Christian Robottom Reis | http://async.com.br/%7Ekiko/ | [+55 16] 3361 2331