[Tutor] need help with comparing list of sequences in
Python!!
Kent Johnson
kent_johnson at skillsoft.com
Tue Aug 31 13:04:09 CEST 2004
Fuzzi,
Here is one way to do this:
- Use zip() to pair up elements from the two sequences
>>> s1='aaabbbbcccc'
>>> s2='aaaccccbcccccccccc'
>>> zip(s1, s2)
[('a', 'a'), ('a', 'a'), ('a', 'a'), ('b', 'c'), ('b', 'c'), ('b', 'c'),
('b', 'c'), ('c', 'b'), ('c', 'c'), ('c', 'c'), ('c', 'c')]
- Use a list comprehension to compare the elements of the pair and put the
results in a new list. I'm not sure if you want to count the matches or the
mismatches - your original post says mismatches, but in your example you
count matches. This example counts matches but it is easy to change.
>>> [a == b for a, b in zip(s1, s2)]
[True, True, True, False, False, False, False, False, True, True, True]
- In Python, True has a value of 1 and False has a value of 0, so adding up
the elements of this list gives the number of matches:
>>> sum([a == b for a, b in zip(s1, s2)])
6
- min() and len() give you the length of the shortest sequence:
>>> min(len(s1), len(s2))
11
- When you divide, you have to convert one of the numbers to a float or
Python will use integer division!
>>> 6/11
0
>>> float(6)/11
0.54545454545454541
Put this together with the framework that Alan gave you to create a program
that calculates distances. Then you can start on the randomization part.
Kent
At 04:03 AM 8/31/2004 +0100, Fathima Javeed wrote:
>Hi Kent
>
>To awnser your question:
>well here is how it works
>sequence one = aaabbbbcccc
>length = 11
>
>seq 2 = aaaccccbcccccccccc
>length = 18
>
>to get the pairwise similarity of this score the program compares the letters
>of the two sequences upto length = 11, the length of the shorter sequence.
>
>so a match gets a score of 1, therefore using + for match and x for mismatch
>
>aaabbbbcccc
>aaaccccbcccccccccc
>+++xxxxx+++
>
>there fore the score = 6/11 = 0.5454 or 54%
>
>so you only score the first 11 letters of each score and its is not
>required to compare the rest of the sequence 2. this is what the
>distance matrix is doing
>
>match score == 6
>
>The spaces are deleted to make both of them the same length
>
>
>>From: Kent Johnson <kent_johnson at skillsoft.com>
>>To: "Fathima Javeed" <fathimajaveed at hotmail.com>, tutor at python.org
>>Subject: Re: [Tutor] need help with comparing list of sequences in
>>Python!!
>>Date: Mon, 30 Aug 2004 13:53:19 -0400
>>
>>Fuzzi,
>>
>>How do you count mismatches if the lengths of the sequences are
>>different? Do you start from the front of both sequences or do you look
>>for a best match? Do you count the extra characters in the longer string
>>as mismatches or do you ignore them? An example or two would help.
>>
>>For example if
>>s1=ABCD
>>s2=XABDDYY
>>how many characters do you count as different?
>>
>>Kent
>>
>>At 07:00 PM 8/29/2004 +0100, Fathima Javeed wrote:
>>>Hi,
>>>would really appreciate it if someone could help me in Python as i am
>>>new to the language.
>>>
>>>Well i have a list of protein sequences in a text file, e.g. (dummy data)
>>>
>>>MVEIGEKAPEIELVDTDLKKVKIPSDFKGKVVVLAFYPAAFTSVCTKEMCTFRDSMAKFNEVNAVVIGISVDP
>>>PFS
>>>
>>>MAPITVGDVVPDGTISFFDENDQLQTVSVHSIAAGKKVILFGVPGAFTPTCSMSHVPGFIGKAEELKSKG
>>>
>>>APIKVGDAIPAVEVFEGEPGNKVNLAELFKGKKGVLFGVPGAFTPGCSKTHLPGFVEQAEALKAKGVQVVACL
>>>SVND
>>>
>>>HGFRFKLVSDEKGEIGMKYGVVRGEGSNLAAERVTFIIDREGNIRAILRNI
>>>
>>>etc etc
>>>
>>>They are not always of the same length,
>>>
>>>The first sequence is always the reference sequence which i am tring to
>>>investigate, basically to reach the objective, i need to compare each
>>>sequence with the first one, starting with the the comparison of the
>>>reference sequence by itself.
>>>
>>>The objective of the program, is to manupulate each sequence i.e.
>>>randomly change characters and calculate the distance (Distance: Number
>>>of letters between a pair of sequnces that dont match DIVIDED by the
>>>length of the shortest sequence) between the sequence in question
>>>against the reference sequence. So therefore need a program code where
>>>it takes the first sequence as a reference sequence (constant which is
>>>on top of the list), first it compares it with itself, then it compares
>>>with the second sequence, then with the third sequence etc etc each at a time,
>>>
>>>for the first comparison, you take a copy of the ref sequnce and
>>>manupulate the copied sequence) i.e. randomly changing the letters in
>>>the sequence, and calculating the distances between them.
>>>(the letters that are used for this are: A R N D C E Q G H I L K M F P S
>>>T W Y V)
>>>
>>>The reference sequence is never altered or manupulated, for the first
>>>comparison, its the copied version of the reference sequence thats altered.
>>>
>>>Randomization is done using different P values
>>>e.g for example (P = probability of change)
>>>if P = 0 no random change has been done
>>>if P = 1.0 all the letters in that particular sequence has been
>>>randomly changed, therefore p=1.0 equals to the length of the sequence
>>>
>>>So its calculating the distance each time between two sequences ( first
>>>is always the reference sequnce and another second sequence) at each P
>>>value ( starting from 0, then 0.1, 0.2, ....... 1.0).
>>>
>>>Note: Number of sequnces to be compared could be any number and of any
>>>length
>>>
>>>I dont know how to compare each sequence with the first sequnce and how
>>>to do randomization of the characters in the sequnce therefore to
>>>calculate the distance for each pair of sequnce , if someone can give me
>>>any guidance, I would be greatful
>>>
>>>Cheers
>>>Fuzzi
>>>
>>>_________________________________________________________________
>>>Stay in touch with absent friends - get MSN Messenger
>>>http://www.msn.co.uk/messenger
>>>
>>>_______________________________________________
>>>Tutor maillist - Tutor at python.org
>>>http://mail.python.org/mailman/listinfo/tutor
>
>_________________________________________________________________
>It's fast, it's easy and it's free. Get MSN Messenger today!
>http://www.msn.co.uk/messenger
More information about the Tutor
mailing list