difflib

Sun Apr 22 08:27:56 EDT 2001

Mikkel Rasmussen wrote in message ...
>Has anybody got any references for the algorithm used in difflib. The
>documentation says:
>
>"The basic algorithm predates, and is a little fancier than, an algorithm
>published in the late 1980's by Ratcliff and Obershelp under the hyperbolic
>name ``gestalt pattern matching.'' The idea is to find the longest
>contiguous matching subsequence that contains no ``junk'' elements (the
>Ratcliff and Obershelp algorithm doesn't address junk). ...

I'm also surprised.  I haven't heard of that alignment method
before, as compared to sequence alignment methods for bioinformatics.
But then I haven't really delved into that topic with sequences.

Still, I'm surprised to see the comments:

> This does not yield minimal edit sequences, but does tend to yield
> matches that "look right" to people.

> Timing:  Basic R-O is cubic time worst case and quadratic time expected
> case.  SequenceMatcher is quadratic time for the worst case and has
> expected-case behavior dependent in a complicated way on how many
> elements the sequences have in common; best case time is linear.

I'm used to methods like Needleman and Wunsch which are worst case
quadradic to get the optimal gapped alignment, with best case linear
time.

But Uncle Tim knows this:
http://mail.python.org/pipermail/python-dev/2001-January/011603.html

Perhaps some of the biopython work along these lines might be
relevant to the general population?  Though it really boils down
to what kind of alignment you want - longest match, fewest insertion
and deletions, etc.  Hmm, and it seems difflib has support for
masking what is sometimes viewed as low information content regions.

For funsies, here's another link:
http://www.csse.monash.edu.au/~lloyd/tildeStrings/Compress/1998TR98-14-align
.html

                    Andrew
                    dalke at acm.org