[Tutor] Need advise on protein sequence alignment code in python 2.7 giving wrong output

Alan Gauld alan.gauld at btinternet.com
Tue Aug 11 13:42:15 CEST 2015

```On 11/08/15 09:53, Белякова Анастасия wrote:
> I am trying to develop code to perform global sequence alignment in python
> 2.7 for proteins with gap penalty of -5 and blossom64 scoring matrix. Code
> I have now give wrong output.

This list is for folks learning the Python language and
standard library. Given that you must assume that most
of us have no idea what you are talking about. We are
not bio-scientists.

So you will need to give uis a bit more explanation about what's
going on. What is blossom64 scoring? How does it work?
Whats a gap penalty? and so on.

There may also be more appropriate fora where you will find
people who know your area? Perhaps in the SciPy community?

Given all of that I can only make a few comments below.

> When I give 'PLEASANTLY' and 'MEANLY' as input, I get 'PLEASANTLY' and
> 'M-EAN--L-Y' as output instead of 'PLEASANTLY' and '-MEA--N-LY'. Blossom62
> scoring matrix is given as dictionary {'A': {'A': 4, 'C': 0, etc. -5 is
> insertion/deletion penalty.

Again that means little to me.

> def scoring(v,w,blossom62):
>     S = [*(len(w)+1) for _ in xrange(len(v)+1)]
>     for i in range(0,len(w)+1):
>         S[i]=i*-5

I think this could be clearer as

for i in range(len(S)):
S[i] = i * -5

>     for j in range (0,len(v)+1):
>         S[j]=j*-5

for i,row in enumerate(S):
row = i * -5

>     for i in xrange(len(v)):
>         for j in xrange(len(w)):
>               S[i+1][j+1] =
> max(S[i+1][j]-5,S[i][j+1]-5,S[i][j]+blossom62[v[i]][w[j]] )

I'll have to assume that's correct but I'd probably
separate the max() terms to make them easier to see:

max(S[i+1][j]-5,
S[i][j+1]-5,
S[i][j]+blossom62[v[i]][w[j]] )

>     backtrack=[*(len(w)) for _ in xrange(len(v))]
>     for i in xrange(len(v)):
>        for j in xrange(len(w)):
>           if max(S[i][j-1],S[i-1][j],S[i-1][j-1]) == S[i-1][j]:
>               backtrack[i][j]= 0
>           elif max(S[i][j-1],S[i-1][j],S[i-1][j-1]) == S[i][j-1]:
>               backtrack[i][j]= 1
>           elif max(S[i][j-1],S[i-1][j],S[i-1][j-1]) == S[i-1][j-1]:
>               backtrack[i][j]= 2

Again I'd separate the max terms - even if only by adding a space.

if max(S[i][j-1], S[i-1][j], S[i-1][j-1]) == S[i-1][j]:

Readability helps a lot in debugging.
In fact I'd be tempted to define a function max_index()
to return the max index of the input tuple. It would
improve readability and you can test it independently.

Something like:

def max_index(*t):
return tuple(t).index(max(t))

for i in xrange(len(v)):
for j in xrange(len(w)):
backtrack[i][j]= max_index(...)

which expresses the logic more clearly and avoids
the potential error implicit in duplicating the
terms for each test.

>    def insert_indel(word,pos):
>        return word[:pos]+'-'+word[pos:]

Defining a function in the middle of your program logic is just
confusing. If it must be an inner function move it to the top.
Better still move it outside to module scope. As it is, it just
interrupts the flow of your code.

>     v_aligned, w_aligned = v, w
>     i,j =len(v)-1,len(w)-1
>     while i*j!=0:
>        if backtrack[i][j]==0:
>            i-=1
>            w_aligned=insert_indel(w_aligned,j)
>        elif backtrack[i][j]==1:
>            j-=1
>            v_aligned=insert_indel(v_aligned,i)
>        elif backtrack[i][j]==2:
>            j-=1
>            i-=1

Again I need to assume that's correct.

>     print v_aligned
>     print w_aligned

increased clarity may make debugging easier.

You could also try splitting the last while loop out into a
separate function. You can then test it in isolation, which
may help identify where the error lies.

--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos

```