Comparing two book chapters (text files)
crebert at ucsd.edu
Wed Feb 4 20:41:41 EST 2009
On Wed, Feb 4, 2009 at 5:20 PM, Nick Matzke <matzke at berkeley.edu> wrote:
> Hi all,
> So I have an interesting challenge. I want to compare two book chapters,
> which I have in plain text format, and find out (a) percentage similarity
> and (b) what has changed.
> Some features make this problem different than what seems to be the standard
> text-matching problem solvable with e.g. difflib. Here is what I mean:
> * there is no guarantee that single lines from each file will be directly
> comparable -- e.g., if a few words are inserted into a sentence, then a
> chunk of the sentence will be moved to the next line, then a chunk of that
> line moved to the next, etc.
> * Also, there are cases where paragraphs have been moved around, sections
> re-ordered, etc. So it can't just be a "linear" match.
> I imagine this kind of thing can't be all that hard in the grand scheme of
> things, but I couldn't find an easily applicable solution readily available.
> I have advanced beginner python skills but am not quite where I could do
> this kind of thing from scratch without some guidance about the likely
> functions, libraries etc. to use.
> PS: I am going to have to do this for multiple book chapters so various
> software packages, e.g. for windows, are not really usable.
Though not written in Python, wdiff
(http://www.gnu.org/software/wdiff/wdiff.html) might be a good
Follow the path of the Iguana...
More information about the Python-list