Diff of Text

GZ zyzhu2000 at gmail.com
Sat Jun 5 21:32:21 EDT 2010


Hi Lie,

On Jun 5, 2:53 am, Lie Ryan <lie.1... at gmail.com> wrote:
> On 06/05/10 15:43, GZ wrote:
>
>
>
>
>
> > On Jun 4, 8:37 pm, Lie Ryan <lie.1... at gmail.com> wrote:
> >> On06/05/10 07:51, GZ wrote:
> >>> No, rsync does not solve my problem.
>
> >>> I want a library that does unix 'diff' like function, i.e. compare two
> >>> strings line by line and output the difference. Python's difflib does
> >>> not work perfectly for me, because the resulting differences are
> >>> pretty big. I would like an algorithm that generates the smallest
> >>> differences.
>
> >> is n=0 not short enough?
>
> >> pprint.pprint(list(difflib.context_diff(s, t, n=0)))
>
> > This still does not do what I want it to do. It only displays the diff
> > results in a different format. I want a different algorithm to
> > generate a smaller diff -- in other words less differences
>
> No, I meant I was confirming that you already have turned off context
> lines (i.e. the n=0 part), right?
>
> Also, what's the nature of the changes? You might be able to minimize
> difflib's output by using word-based or character-based diff-ing instead
> of traditional line-based diff-ing.
>
> diff output is fairly compressable, so you might want to look at zipping
> the output.- Hide quoted text -
>
> - Show quoted text -

Thanks for your response.

The verboseness of the format is not really my problem and I only care
about line by line comparison for now.

Let me think of a better way to express what I mean by a "smaller
diff." After I diff the two strings, I will have something like this:

  AAA
- BBB
+ CCC
+ DDD
- EEE

It means the first line does not change, the second line is replaced
by the third line, the forth line is new, and the fifth line is
deleted.

I define the "smallness" of the diff algorithm as "the sum of the
total number of minuses and pluses". In my above example, it is 4 (two
minuses and 2 pluses). Note that no matter what format we use to
represent the diff, this number is the same.

Python's difflib does not really minimize this number. It tries to
make this number small, but also tries to yield matches that “look
right” to people at the cost of increasing this number. (http://
docs.python.org/library/difflib.html).

What I am looking for is an algo that can really minimize this number.



More information about the Python-list mailing list