difflib

Sun Apr 22 21:43:48 EDT 2001

Tim:
>That's because you're trying to outguess nature, but SequenceMatcher is
>trying to outguess people:  the latter doesn't care about the most
efficient
>way to change X into Y, it wants to guess what a *person* probably did when
>changing X into Y by hand.

I think you might be surprised.  There are "optimal" sequence
alignment programs but people will modify the results by hand to
make it look like what they expect.  The problem is that there is
no true optimal solution, ending up with sequence alignment programs
which a lot of tunable parameters and with humans modifying the
results.

>    before:  private Thread currentThread;
>    after:   private volatile Thread currentThread;
>
>Run this thru your usual sequence comparator, and it's likely to say "Ah, I
>know!  They inserted 'e volatil' between the 't' and 'e' at the end of
>'private'".

If I did that, I would say that spaces define gaps so there is much
less cost to adding characters at those sites.  Then the alignment
code would say "add 'volatile ' after 'private '" which is as expected.

>    before:  ab
>    after:   acab
>
>"Ah, I know!  They stuffed 'ca' into the middle!" (something Unix diff and
>Windows Windiff both believe, albeit via entirely different algorithms).

Adding to the beginning or end of a sequence should be a lower
penalty than adding in the middle.  Most good bio alignments should
handle this case as expected.

>"more efficient" ways to get from before to after.  AFAIK, it's only people
>using vi who spend minutes plotting out the most efficient key sequence to
>get a job done <wink>.

Biological sequence alignment programs don't like breaking up
regions into lots of parts because it's more likely to destroy a
coding region, or break an important structure.  So they introduce
gaps and penalities associated with different types of gaps (creating
a gap, increasing the size of a gap, adding to the beginning or end).
These usually turn out to make this more like what people expect.

And you haven't seen the *hours* people will spend to align sequences
by hand.  I know at least one case where someone became a coauthor
on a paper because the hand aligment he did proved very insightful.
(This was for a multiple sequence alignment which is an NP problem
and can't be solved optimally for more than about a dozen sequences.)

>I don't know, but I named it "difflib.py" instead of "sequencematcher.py"
>just in case one of you intrepid researchers did have something of general
>value.

Hey - working code is good.  We've got nothing that could be used
for this any time soon, and anything we're likely to do will be
optimized for aligning letters, not words or lines.  I was just
surprised not to have heard of the alignment method before.

I will keep difflib in mind as we work on this problem.

>> Hmm, and it seems difflib has support for masking what is sometimes
>> viewed as low information content regions.
>
>No, Wootton has support for what difflib views as clumps of junk.  I'm not
a
>fan of overblown terminology unless *so* overblown it's funny <wink>.

Wink noted,  but I feel like being elaborative today.  <wink>

"Junk" has another meaning in DNA sequences.  Junk DNA is noncoding
regions (they don't produce protein) but some junk DNA appears
conserved, perhaps because it was created after a recent duplication
event or because it has a impact on transcription elsewhere.

So we need to use another phrase besides "junk" for what you call
"junk" (people want to align junk DNA).  "Low information content"
works.  BTW, the phrase isn't even exact - purely random sequences
contain a lot of information in the entropy sense, but don't code
for anything real, so biologists ignore it because it contains
no information to them.

                    Andrew
                    dalke at acm.org