[Tutor] Word-by-word diff in Python
Alexandre Ratti
Fri, 19 Apr 2002 11:27:41 +0200
Hi Danny,
At 20:29 18/04/2002 -0700, Danny Yoo wrote:
>Interesting! Hmmm... if the indentation or formatting is significant, we
>could transform a line-by-line diff utility into a word-by-word by turning
>the newlines into some sort of sentinel "NEWLINE" character.
Yes, thanks; that's the key idea. I had toyed with difflib but I fed it two
strings instead of word lists; hence it spat back a character-based diff.
>We could then apply a string.split() to break the lines into individual
>words. Python comes with a standard library module called "difflib":
> http://www.python.org/doc/current/lib/module-difflib.html
One problem I had with this module is that it was added to the standard
library in 2.1 and 2.2, whereas I try to make my app compatible with Python
I found a partial backported version in ViewCVS; it features the
SequenceMatcher class, but the ndiff class is not included.
> >>> difflib.ndiff(revision_1, revision_2)
><generator object at 0x81641b8>
> >>> diff = difflib.ndiff(revision_1, revision_2)
>To grab the whole diff at once, let's convince Python to give it to us as
>a list:
> >>> results = list(difflib.ndiff(revision_1, revision_2))
>And the output here can be modified to look like a nice HTML formatted
>text with strikeouts and everything. *grin*
Yes, that's good. I found an example that used the SequenceMatcher class
directly, though it's lower-level. Here is a test implementation:
from difflib import SequenceMatcher
class TextDiff:
"""Create diffs of text snippets."""
def __init__(self, source, target):
"""source = source text - target = target text"""
self.nl = "<NL>"
self.delTag = "<span class='deleted'>%s</span>"
self.insTag = "<span class='inserted'>%s</span>"
self.source = source.replace("\n", "\n%s" % self.nl).split()
self.target = target.replace("\n", "\n%s" % self.nl).split()
self.deleteCount, self.insertCount, self.replaceCount = 0, 0, 0
self.diffText = None
self.cruncher = SequenceMatcher(None, self.source,
def _buildDiff(self):
"""Create a tagged diff."""
outputList = []
for tag, alo, ahi, blo, bhi in self.cruncher.get_opcodes():
if tag == 'replace':
# Text replaced = deletion + insertion
outputList.append(self.delTag % "
outputList.append(self.insTag % "
self.replaceCount += 1
elif tag == 'delete':
# Text deleted
outputList.append(self.delTag % "
self.deleteCount += 1
elif tag == 'insert':
# Text inserted
outputList.append(self.insTag % "
self.insertCount += 1
elif tag == 'equal':
# No change
outputList.append(" ".join(self.source[alo:ahi]))
diffText = " ".join(outputList)
diffText = " ".join(diffText.split())
self.diffText = diffText.replace(self.nl, "\n")
def getStats(self):
"Return a tuple of stat values."
return (self.insertCount, self.deleteCount, self.replaceCount)
def getDiff(self):
"Return the diff text."
return self.diffText
if __name__ == "__main__":
ch1 = """Today, a generation raised in the shadows of the Cold
War assumes new responsibilities in a world warmed by the sunshine of
ch2 = """Today, pythonistas raised in the shadows of the Cold
War assumes responsibilities in a world warmed by the sunshine of
spam and freedom"""
differ = TextDiff(ch1, ch2)
print "%i insertion(s), %i deletion(s), %i replacement(s)" %
print differ.getDiff()
1 insertion(s), 1 deletion(s), 1 replacement(s)
Today, a <span class='deleted'>generation</span> <span
class='inserted'>pythonista</span> raised in the shadows of the Cold
War assumes <span class='deleted'>new</span> responsibilities in a world
warmed by the sunshine of
<span class='inserted'>spam and</span> freedom