[Tutor] need help with comparing list of sequences in Python!!

Mon Aug 30 19:53:19 CEST 2004

Fuzzi,

How do you count mismatches if the lengths of the sequences are different? 
Do you start from the front of both sequences or do you look for a best 
match? Do you count the extra characters in the longer string as mismatches 
or do you ignore them? An example or two would help.

For example if
s1=ABCD
s2=XABDDYY
how many characters do you count as different?

Kent

At 07:00 PM 8/29/2004 +0100, Fathima Javeed wrote:
>Hi,
>would really appreciate it if someone could help me in Python as i am new 
>to the language.
>
>Well i have a list of protein sequences in a text file, e.g. (dummy data)
>
>MVEIGEKAPEIELVDTDLKKVKIPSDFKGKVVVLAFYPAAFTSVCTKEMCTFRDSMAKFNEVNAVVIGISVDP
>PFS
>
>MAPITVGDVVPDGTISFFDENDQLQTVSVHSIAAGKKVILFGVPGAFTPTCSMSHVPGFIGKAEELKSKG
>
>APIKVGDAIPAVEVFEGEPGNKVNLAELFKGKKGVLFGVPGAFTPGCSKTHLPGFVEQAEALKAKGVQVVACL
>SVND
>
>HGFRFKLVSDEKGEIGMKYGVVRGEGSNLAAERVTFIIDREGNIRAILRNI
>
>etc etc
>
>They are not always of the same length,
>
>The first sequence is always the reference sequence which i am tring to 
>investigate, basically to reach the objective, i need to compare each 
>sequence with the first one, starting with the the comparison of the 
>reference sequence by itself.
>
>The objective of the program, is to manupulate each sequence i.e. randomly 
>change characters and calculate the distance (Distance: Number of letters 
>between a pair of sequnces that dont match  DIVIDED by the length of the 
>shortest sequence) between the sequence in question against the reference 
>sequence. So therefore need  a program code where it takes the first 
>sequence as a reference sequence (constant which is on top of the list), 
>first it compares it with itself, then it compares with the second 
>sequence, then with the third sequence etc etc  each at a time,
>
>for the first comparison, you take a copy of the ref sequnce and 
>manupulate the copied sequence) i.e. randomly changing the letters in the 
>sequence, and calculating the distances between them.
>(the letters that are used for this are: A R N D C E Q G H I L K M F P S T 
>W Y V)
>
>The reference sequence is never altered or manupulated, for the first 
>comparison, its the copied version of the reference sequence thats altered.
>
>Randomization is done using different P values
>e.g for example (P = probability of change)
>if P = 0      no random change has been done
>if P = 1.0   all the letters in that particular sequence has been randomly 
>changed, therefore p=1.0 equals to the length of the sequence
>
>So its calculating the distance each time between two sequences ( first is 
>always the reference sequnce and another second sequence) at each P value 
>( starting from 0, then 0.1, 0.2, ....... 1.0).
>
>Note: Number of sequnces to be compared could be any number and of any length
>
>I dont know how to compare each sequence with the first sequnce and how to 
>do randomization of the characters in the sequnce therefore to calculate 
>the distance for each pair of sequnce , if someone can give me any 
>guidance, I would be greatful
>
>Cheers
>Fuzzi
>
>_________________________________________________________________
>Stay in touch with absent friends - get MSN Messenger 
>http://www.msn.co.uk/messenger
>
>_______________________________________________
>Tutor maillist  -  Tutor at python.org
>http://mail.python.org/mailman/listinfo/tutor