difflib optimization -- wow (tim peters?)

Mon Sep 16 18:50:04 EDT 2002

[jepler at unpythonic.net]
> wow, what a speedup!   thanks, whoever it was (tim?)

Don't thank me yet -- it's in a semi-broken state now, although you won't
notice that when comparing text files containing text <wink>.  By default,
it's dynamically deciding what constitutes a "junk line" now, instead of
using the "this is junk" regexp argument that nobody bothers to set.  Given
C source input, for example, it's now likely to decide that lines consisting
solely of

		}

are junk.  Trying to synch up on what are really noise lines costs an
enormous amount of time, so getting a better idea of what *is* likely junk
saves an enormous amount of time.  OTOH, if you have two files consisting
solely of repetitions of "A\n" and "B\n" lines, the current code is going to
decide they're *all* junk lines.  That's the sense in which it's currently
broken.

>     $ time python2.3 -O ~/pyunidiff.py foo.c foo2.c > /tmp/pat
>
>     real    0m5.457s
>     user    0m4.640s
>     sys     0m0.110s
>     $ time python2.2 -O ~/pyunidiff.py foo.c foo2.c > /tmp/pat
>
>     real    3m14.042s
>     user    3m7.480s
>     sys     0m1.180s
>
> foo.c and foo2.c are ~300k long C files from different releases of a
> piece of software.  the diff produced is nearly 250k.
>
> pyunidiff is at http://unpythonic.net/~jepler/pyunidiff.py

Three attempts to connect to that timed out for me, so I don't know what
pyunidiff is doing.  Since ndiff outputs the full text of both input files,
I wouldn't consider it a good thing if it produced an output smaller than
its inputs (or, if that is a good thing, I'm on to a great new compression
scheme <wink>).

> tim, I only need another factor of 30x and it'll beat diff(1) -u in wall
> time (or 5x to beat diff(1) -u -d).  It already gives files about 5%
> shorter on the files I've tried (or on par with diff -u -d).

What is "it", and 5% shorter than what?

> I notice that the diff produced from 2.2 is marginally smaller, another
> 1.5% or so.

See above.  You *seem* to be saying that 2.3's output is both 5% shorter and
about 1.5% longer than 2.2's output.

> The price to pay for the optimization?

The difflib algorithm couldn't care less how long the diffs it produces are.
It's aiming at human comprehensibility, not at minimality.  Tell me that the
new output is 1.5% less intuitive than 2.2's and then I'd care -- provided
you could convince me you had an objective way to measure intuitiveness to
two significant digits <wink>.

It would have been a pointless exercise to recode diff's algorithm in
Python.

if-you-want-diff-you-know-where-to-find-it-ly y'rs  - tim