[Tutor] Word-by-word diff in Python

Fri, 19 Apr 2002 11:27:41 +0200

Hi Danny,

At 20:29 18/04/2002 -0700, Danny Yoo wrote:
>Interesting!  Hmmm... if the indentation or formatting is significant, we
>could transform a line-by-line diff utility into a word-by-word by turning
>the newlines into some sort of sentinel "NEWLINE"  character.

Yes, thanks; that's the key idea. I had toyed with difflib but I fed it two 
strings instead of word lists; hence it spat back a character-based diff.

>We could then apply a string.split() to break the lines into individual
>words.  Python comes with a standard library module called "difflib":
>
>     http://www.python.org/doc/current/lib/module-difflib.html

One problem I had with this module is that it was added to the standard 
library in 2.1 and 2.2, whereas I try to make my app compatible with Python 
2.0.

I found a partial backported version in ViewCVS; it features the 
SequenceMatcher class, but the ndiff class is not included.

> >>> difflib.ndiff(revision_1, revision_2)
><generator object at 0x81641b8>
> >>> diff = difflib.ndiff(revision_1, revision_2)

>To grab the whole diff at once, let's convince Python to give it to us as
>a list:

> >>> results = list(difflib.ndiff(revision_1, revision_2))

>And the output here can be modified to look like a nice HTML formatted
>text with strikeouts and everything.  *grin*

Yes, that's good. I found an example that used the SequenceMatcher class 
directly, though it's lower-level. Here is a test implementation:

##
from difflib import SequenceMatcher

class TextDiff:
     """Create diffs of text snippets."""

     def __init__(self, source, target):
         """source = source text - target = target text"""
         self.nl = "<NL>"
         self.delTag = "<span class='deleted'>%s</span>"
         self.insTag = "<span class='inserted'>%s</span>"
         self.source = source.replace("\n", "\n%s" % self.nl).split()
         self.target = target.replace("\n", "\n%s" % self.nl).split()
         self.deleteCount, self.insertCount, self.replaceCount = 0, 0, 0
         self.diffText = None
         self.cruncher = SequenceMatcher(None, self.source,
                                         self.target)
         self._buildDiff()

     def _buildDiff(self):
         """Create a tagged diff."""
         outputList = []
         for tag, alo, ahi, blo, bhi in self.cruncher.get_opcodes():
             if tag == 'replace':
                 # Text replaced = deletion + insertion
                 outputList.append(self.delTag % " 
".join(self.source[alo:ahi]))
                 outputList.append(self.insTag % " 
".join(self.target[blo:bhi]))
                 self.replaceCount += 1
             elif tag == 'delete':
                 # Text deleted
                 outputList.append(self.delTag % " 
".join(self.source[alo:ahi]))
                 self.deleteCount += 1
             elif tag == 'insert':
                 # Text inserted
                 outputList.append(self.insTag % " 
".join(self.target[blo:bhi]))
                 self.insertCount += 1
             elif tag == 'equal':
                 # No change
                 outputList.append(" ".join(self.source[alo:ahi]))
         diffText = " ".join(outputList)
         diffText = " ".join(diffText.split())
         self.diffText = diffText.replace(self.nl, "\n")

     def getStats(self):
         "Return a tuple of stat values."
         return (self.insertCount, self.deleteCount, self.replaceCount)

     def getDiff(self):
         "Return the diff text."
         return self.diffText

if __name__ == "__main__":
     ch1 = """Today, a generation raised in the shadows of the Cold
     War assumes new responsibilities in a world warmed by the sunshine of
     freedom"""

     ch2 = """Today, pythonistas raised in the shadows of the Cold
     War assumes responsibilities in a world warmed by the sunshine of
     spam and freedom"""

     differ = TextDiff(ch1, ch2)

     print "%i insertion(s), %i deletion(s), %i replacement(s)" % 
differ.getStats()
     print differ.getDiff()

1 insertion(s), 1 deletion(s), 1 replacement(s)
Today, a <span class='deleted'>generation</span> <span 
class='inserted'>pythonista</span> raised in the shadows of the Cold
  War assumes <span class='deleted'>new</span> responsibilities in a world 
warmed by the sunshine of
  <span class='inserted'>spam and</span> freedom
##

Cheers.

Alexandre