Comparing two book chapters (text files)

M.-A. Lemburg mal at
Thu Feb 5 07:00:30 EST 2009

On 2009-02-05 02:20, Nick Matzke wrote:
> Hi all,
> So I have an interesting challenge.  I want to compare two book
> chapters, which I have in plain text format, and find out (a) percentage
> similarity and (b) what has changed.
> Some features make this problem different than what seems to be the
> standard text-matching problem solvable with e.g. difflib.  Here is what
> I mean:
> * there is no guarantee that single lines from each file will be
> directly comparable -- e.g., if a few words are inserted into a
> sentence, then a chunk of the sentence will be moved to the next line,
> then a chunk of that line moved to the next, etc.
> * Also, there are cases where paragraphs have been moved around,
> sections re-ordered, etc.  So it can't just be a "linear" match.
> I imagine this kind of thing can't be all that hard in the grand scheme
> of things, but I couldn't find an easily applicable solution readily
> available.  I have advanced beginner python skills but am not quite
> where I could do this kind of thing from scratch without some guidance
> about the likely functions, libraries etc. to use.
> PS: I am going to have to do this for multiple book chapters so various
> software packages, e.g. for windows, are not really usable.
> Any help is much appreciated!!

difflib is in the Python stdlib and provides many ways to implement
difference detection:

Here's a script that I use for diff'ing text files on a word
basis, called

It helps a lot with text that gets word wrapped or reformatted.

Marc-Andre Lemburg

Professional Python Services directly from the Source  (#1, Feb 05 2009)
>>> Python/Zope Consulting and Support ...
>>> mxODBC.Zope.Database.Adapter ...   
>>> mxODBC, mxDateTime, mxTextTools ...

::: Try our new mxODBC.Connect Python Database Interface for free ! :::: Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611

More information about the Python-list mailing list