difflib

Wed Apr 25 20:53:42 EDT 2001

[Tim]
>> That's because you're trying to outguess nature, but SequenceMatcher is
>> trying to outguess people:  the latter doesn't care about the most
>> efficient way to change X into Y, it wants to guess what a *person*
>> probably did when changing X into Y by hand.

[Andrew Dalke]
> I think you might be surprised.

Not a chance <wink>.

> There are "optimal" sequence alignment programs but people will
> modify the results by hand to make it look like what they expect.
> The problem is that there is no true optimal solution, ending up
> with sequence alignment programs which a lot of tunable parameters
> and with humans modifying the results.

Not surprised yet ...

WRT your comments on my examples, yes, because "the usual" run of sequence
alignment algorithms *does* have "a lot of tunable parameters", I'd be
surprised if you *couldn't* usually get the same effect on specific examples
via fiddling enough free parameters.  ndiff has no free parameters, though,
apart from a user-definable notion of what's uninteresting as a synch point.
A more fundamental difference is that ndiff has an extremely strong notion of
locality, which you'll *think* of as it effectively setting the cost for
matching on junk to +infinity -- but that's just a way of trying to force it
into a mold it didn't come from.  It's really not trying to minimize a global
cost function (although I could dervive a global cost function it *does*
minimize, that would be quite a strain).

> ...
> Biological sequence alignment programs don't like breaking up
> regions into lots of parts because it's more likely to destroy a
> coding region, or break an important structure.  So they introduce
> gaps and penalities associated with different types of gaps (creating
> a gap, increasing the size of a gap, adding to the beginning or end).
> These usually turn out to make this more like what people expect.

I could repeat that you're trying to outguess nature and ndiff isn't, but
somehow I don't think that would help <wink>.

> And you haven't seen the *hours* people will spend to align sequences
> by hand.  I know at least one case where someone became a coauthor
> on a paper because the hand aligment he did proved very insightful.
> (This was for a multiple sequence alignment which is an NP problem
> and can't be solved optimally for more than about a dozen sequences.)

I've seen the hours people spend aligning sequences by hand in speech
recognition work, and at several levels (like aligning hypothesized phoneme
sequences to signal features, then aligning hypothesized word sequences to
phoneme sequences, etc).  People learn a lot from doing that by hand,
although doing it by hand hasn't been very practical in products <wink>.
nidff's algorithms isn't suitable for any phase of that either, btw.

> ...
> We've got nothing that could be used for this any time soon, and
> anything we're likely to do will be optimized for aligning letters,
> not words or lines.  I was just surprised not to have heard of the
> alignment method before.

You shouldn't be!  Last I looked, there were over 100 distinct algorithms in
the literature just for finding out where one string starts in another
string.  Many people have heard of Boyer-Moore and Knuth-Morris-Pratt, but
they're just a drop in that bucket.  Keeping up with *everything* that's out
there is simply impossible.  BTW, most people never heard of the sort
algorithm Python uses either.

the-history-bins-overflow-with-discarded-diamonds-ly y'rs  - tim