[Tutor] need help with comparing list of sequences in Python!!

Kent Johnson kent_johnson at skillsoft.com
Mon Aug 30 19:53:19 CEST 2004


How do you count mismatches if the lengths of the sequences are different? 
Do you start from the front of both sequences or do you look for a best 
match? Do you count the extra characters in the longer string as mismatches 
or do you ignore them? An example or two would help.

For example if
how many characters do you count as different?


At 07:00 PM 8/29/2004 +0100, Fathima Javeed wrote:
>would really appreciate it if someone could help me in Python as i am new 
>to the language.
>Well i have a list of protein sequences in a text file, e.g. (dummy data)
>etc etc
>They are not always of the same length,
>The first sequence is always the reference sequence which i am tring to 
>investigate, basically to reach the objective, i need to compare each 
>sequence with the first one, starting with the the comparison of the 
>reference sequence by itself.
>The objective of the program, is to manupulate each sequence i.e. randomly 
>change characters and calculate the distance (Distance: Number of letters 
>between a pair of sequnces that dont match  DIVIDED by the length of the 
>shortest sequence) between the sequence in question against the reference 
>sequence. So therefore need  a program code where it takes the first 
>sequence as a reference sequence (constant which is on top of the list), 
>first it compares it with itself, then it compares with the second 
>sequence, then with the third sequence etc etc  each at a time,
>for the first comparison, you take a copy of the ref sequnce and 
>manupulate the copied sequence) i.e. randomly changing the letters in the 
>sequence, and calculating the distances between them.
>(the letters that are used for this are: A R N D C E Q G H I L K M F P S T 
>W Y V)
>The reference sequence is never altered or manupulated, for the first 
>comparison, its the copied version of the reference sequence thats altered.
>Randomization is done using different P values
>e.g for example (P = probability of change)
>if P = 0      no random change has been done
>if P = 1.0   all the letters in that particular sequence has been randomly 
>changed, therefore p=1.0 equals to the length of the sequence
>So its calculating the distance each time between two sequences ( first is 
>always the reference sequnce and another second sequence) at each P value 
>( starting from 0, then 0.1, 0.2, ....... 1.0).
>Note: Number of sequnces to be compared could be any number and of any length
>I dont know how to compare each sequence with the first sequnce and how to 
>do randomization of the characters in the sequnce therefore to calculate 
>the distance for each pair of sequnce , if someone can give me any 
>guidance, I would be greatful
>Stay in touch with absent friends - get MSN Messenger 
>Tutor maillist  -  Tutor at python.org

More information about the Tutor mailing list