difflib and intelligent file differences
Dave Angel
davea at dejaviewphoto.com
Thu Mar 26 11:32:00 EDT 2009
First comment, have you looked at the standard module difflib? There's
a sample program diff.py located in tools\scripts that may do what
you need already. It finds the differences in context, and displays
them in a way that's frequently intuitive, showing you what's been
changed, and what's been added or removed. For example, if just one
line has been added, it would display a few lines in front of that one,
and the one line (with a leading +), and then a few lines after it. And
there are switches you can use to get different formatting of the results.
But back to your question, presumably doing it by hand. First question
I have is whether the file's lines are completely independent? For
example, each line is a record in a database, with order irrelevant. If
so, use something like Marco's code. If the files are not fully sorted,
you'll need to do a final pruning at the end, where you delete all
members in common between the two sets.
If the lines are not independent, then you might want to start with
something like difflib.Differ
hayes.tyler at gmail.com wrote:
> Hello All:
>
> I am starting to work on a file comparison script where I have to
> compare the contents of two large files. Originally I thought to just
> sort on a numeric key, and use UNIX's comm to do a line by line
> comparison. However, this would fail, hence my thinking that I really
> should've just used Python from the start. Let me outline the problem.
>
> Imagine two text files, f1 and f2,
>
> f1 is
> 1
> 2
> 3
> 4
> 5
>
> and f2 is
>
> 12
> 2
> 3
> 4
> 5
>
> where each line can be thought of as a record, not a running sentence.
> Okay, this one is easy, in fact, this is just a line by line
> comparison using comm -3 f1 f2. BUT...
> (and this is why I'm thinking of using Python's difflib to work on it)
>
> Now say f1 is
>
> 1
> 2
> 3
> 4
> 5
>
> and f2 is
>
> 2
> 3
> 4
> 5
>
> The only difference of the *contents* is 1, but if you did a line by
> line comparison, all of them would return because of the line
> difference at the beginning. So, what I'm really looking for, is not
> just a line by line comparison, but a file contents comparison.
> Ideally, all I want to generate is a file of lines which would contain
> the differences.
>
> My first thought is to do a sweep, where the first sweep takes one
> line from f1, travels f2, if found, deletes it from a tmp version of
> f2, and then on to the second line, and so on. If not found, it writes
> to a file. At the end, if there are also lines still in f1 that never
> were matched because it was longer, it appends those as well to the
> difference file. At the end, you have a nice summary of the lines
> (i.e., records) which are not found in either file.
>
> Any suggestions where to start?
>
>
More information about the Python-list
mailing list