Comparing two book chapters (text files)
mal at egenix.com
Thu Feb 5 07:00:30 EST 2009
On 2009-02-05 02:20, Nick Matzke wrote:
> Hi all,
> So I have an interesting challenge. I want to compare two book
> chapters, which I have in plain text format, and find out (a) percentage
> similarity and (b) what has changed.
> Some features make this problem different than what seems to be the
> standard text-matching problem solvable with e.g. difflib. Here is what
> I mean:
> * there is no guarantee that single lines from each file will be
> directly comparable -- e.g., if a few words are inserted into a
> sentence, then a chunk of the sentence will be moved to the next line,
> then a chunk of that line moved to the next, etc.
> * Also, there are cases where paragraphs have been moved around,
> sections re-ordered, etc. So it can't just be a "linear" match.
> I imagine this kind of thing can't be all that hard in the grand scheme
> of things, but I couldn't find an easily applicable solution readily
> available. I have advanced beginner python skills but am not quite
> where I could do this kind of thing from scratch without some guidance
> about the likely functions, libraries etc. to use.
> PS: I am going to have to do this for multiple book chapters so various
> software packages, e.g. for windows, are not really usable.
> Any help is much appreciated!!
difflib is in the Python stdlib and provides many ways to implement
Here's a script that I use for diff'ing text files on a word
basis, called tdiff.py:
It helps a lot with text that gets word wrapped or reformatted.
Professional Python Services directly from the Source (#1, Feb 05 2009)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! ::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
More information about the Python-list