[Tutor] need help with comparing list of sequences in Python!!

Mon Aug 30 09:23:27 CEST 2004

> Hi,
> would really appreciate it if someone could help me in Python as i
am new to
> the language.

Well, you've asked for lots of different things but I'll try to answer
the most basic bits first, we can come back to the complicated bits
later:

> The first sequence is always the reference sequence which i am tring
to
> investigate, basically to reach the objective, i need to compare
each
> sequence with the first one, starting with the the comparison of the
> reference sequence by itself.

Here is some pseudo code(ie I haven't tested it)

def compareSequences(se1,seq2):
    return seq1 == seq2  # dummy value for now

data = open('protein.dat')
results = []
reference = data.readline()
test = reference[:]  # take a copy
results.append(compareSequence(reference,test)
for line in data:
    results.append(compareSequence(reference,line)

for result in results: print result

That should give you the basic stucture.

> The objective of the program, is to manupulate each sequence i.e.
randomly
> change characters and calculate the distance (Distance: Number of
letters
> between a pair of sequnces that dont match  DIVIDED by the length of
the
> shortest sequence) between the sequence in question against the
reference
> sequence.

This is the harder bit. This is what will go inside the
compareSEquence()
function we defined above. We might even want to break it down to two
other functions rearrangeSequence() and calculateDistance(). However
I'm
not clear from your description exactly what the algorithm is.
How many times do you rearrange the letters within a given test?
Can you show us two sequences and how to calculate what you want
manually?

> (the letters that are used for this are: A R N D C E Q G H I L K M F
P S T W
> Y V)

I'm assuming these are theletters already in the data, or are you
suggesting that we randomly add new letters rathervthan rearranging
the existing ones? In either case the random module probably will
help us out here.

> Randomization is done using different P values
> e.g for example (P = probability of change)
> if P = 0      no random change has been done
> if P = 1.0   all the letters in that particular sequence has been
randomly
> changed, therefore p=1.0 equals to the length of the sequence

So where do we find this P value? Or is it a case of changing the
sequence
until they match and calculating P as we go? In which case the
compareSequence() function should return P as part of the result?

> So its calculating the distance each time between two sequences (
first is
> always the reference sequnce and another second sequence) at each P
value (
> starting from 0, then 0.1, 0.2, ....... 1.0).

Or do you mean here that we repeat the comparesEquence() test for
each P value(incrementing by 0.1?)

If so modify the loop above to do:

for line in data:
   for P in range(10):
      results.append(compareSequence(reference,line,P/10.0))

and add the P parameter to the function definition.

Hopefully that starts you out, and if you try it, tell us how
you got on and give us more clarity on the algorithm you
should be well on the way.

Alan G
Author of the Learn to Program web tutor
http://www.freenetpages.co.uk/hp/alan.gauld/tutor2/