[Tutor] need help with comparing list of sequences in
Python!!
Kent Johnson
kent_johnson at skillsoft.com
Mon Aug 30 19:53:19 CEST 2004
Fuzzi,
How do you count mismatches if the lengths of the sequences are different?
Do you start from the front of both sequences or do you look for a best
match? Do you count the extra characters in the longer string as mismatches
or do you ignore them? An example or two would help.
For example if
s1=ABCD
s2=XABDDYY
how many characters do you count as different?
Kent
At 07:00 PM 8/29/2004 +0100, Fathima Javeed wrote:
>Hi,
>would really appreciate it if someone could help me in Python as i am new
>to the language.
>
>Well i have a list of protein sequences in a text file, e.g. (dummy data)
>
>MVEIGEKAPEIELVDTDLKKVKIPSDFKGKVVVLAFYPAAFTSVCTKEMCTFRDSMAKFNEVNAVVIGISVDP
>PFS
>
>MAPITVGDVVPDGTISFFDENDQLQTVSVHSIAAGKKVILFGVPGAFTPTCSMSHVPGFIGKAEELKSKG
>
>APIKVGDAIPAVEVFEGEPGNKVNLAELFKGKKGVLFGVPGAFTPGCSKTHLPGFVEQAEALKAKGVQVVACL
>SVND
>
>HGFRFKLVSDEKGEIGMKYGVVRGEGSNLAAERVTFIIDREGNIRAILRNI
>
>etc etc
>
>They are not always of the same length,
>
>The first sequence is always the reference sequence which i am tring to
>investigate, basically to reach the objective, i need to compare each
>sequence with the first one, starting with the the comparison of the
>reference sequence by itself.
>
>The objective of the program, is to manupulate each sequence i.e. randomly
>change characters and calculate the distance (Distance: Number of letters
>between a pair of sequnces that dont match DIVIDED by the length of the
>shortest sequence) between the sequence in question against the reference
>sequence. So therefore need a program code where it takes the first
>sequence as a reference sequence (constant which is on top of the list),
>first it compares it with itself, then it compares with the second
>sequence, then with the third sequence etc etc each at a time,
>
>for the first comparison, you take a copy of the ref sequnce and
>manupulate the copied sequence) i.e. randomly changing the letters in the
>sequence, and calculating the distances between them.
>(the letters that are used for this are: A R N D C E Q G H I L K M F P S T
>W Y V)
>
>The reference sequence is never altered or manupulated, for the first
>comparison, its the copied version of the reference sequence thats altered.
>
>Randomization is done using different P values
>e.g for example (P = probability of change)
>if P = 0 no random change has been done
>if P = 1.0 all the letters in that particular sequence has been randomly
>changed, therefore p=1.0 equals to the length of the sequence
>
>So its calculating the distance each time between two sequences ( first is
>always the reference sequnce and another second sequence) at each P value
>( starting from 0, then 0.1, 0.2, ....... 1.0).
>
>Note: Number of sequnces to be compared could be any number and of any length
>
>I dont know how to compare each sequence with the first sequnce and how to
>do randomization of the characters in the sequnce therefore to calculate
>the distance for each pair of sequnce , if someone can give me any
>guidance, I would be greatful
>
>Cheers
>Fuzzi
>
