[Tutor] need help with comparing list of sequences in Python!!

Kent Johnson kent_johnson at skillsoft.com
Tue Aug 31 13:04:09 CEST 2004


Here is one way to do this:
- Use zip() to pair up elements from the two sequences
 >>> s1='aaabbbbcccc'
 >>> s2='aaaccccbcccccccccc'
 >>> zip(s1, s2)
[('a', 'a'), ('a', 'a'), ('a', 'a'), ('b', 'c'), ('b', 'c'), ('b', 'c'), 
('b', 'c'), ('c', 'b'), ('c', 'c'), ('c', 'c'), ('c', 'c')]

- Use a list comprehension to compare the elements of the pair and put the 
results in a new list. I'm not sure if you want to count the matches or the 
mismatches - your original post says mismatches, but in your example you 
count matches. This example counts matches but it is easy to change.
 >>> [a == b for a, b in zip(s1, s2)]
[True, True, True, False, False, False, False, False, True, True, True]

- In Python, True has a value of 1 and False has a value of 0, so adding up 
the elements of this list gives the number of matches:
 >>> sum([a == b for a, b in zip(s1, s2)])

- min() and len() give you the length of the shortest sequence:
 >>> min(len(s1), len(s2))

- When you divide, you have to convert one of the numbers to a float or 
Python will use integer division!
 >>> 6/11
 >>> float(6)/11

Put this together with the framework that Alan gave you to create a program 
that calculates distances. Then you can start on the randomization part.


At 04:03 AM 8/31/2004 +0100, Fathima Javeed wrote:
>Hi Kent
>To awnser your question:
>well here is how it works
>sequence one = aaabbbbcccc
>length = 11
>seq 2 = aaaccccbcccccccccc
>length = 18
>to get the pairwise similarity of this score the program compares the letters
>of the two sequences upto length = 11, the length of the shorter sequence.
>so a match gets a score of 1, therefore using + for match and x for mismatch
>there fore the score = 6/11 = 0.5454 or 54%
>so you only score the first 11 letters of each score and its is not
>required to compare the rest of the sequence 2. this is what the
>distance matrix is doing
>match score == 6
>The spaces are deleted to make both of them the same length
>>From: Kent Johnson <kent_johnson at skillsoft.com>
>>To: "Fathima Javeed" <fathimajaveed at hotmail.com>, tutor at python.org
>>Subject: Re: [Tutor] need help with comparing list of sequences in
>>Date: Mon, 30 Aug 2004 13:53:19 -0400
>>How do you count mismatches if the lengths of the sequences are 
>>different? Do you start from the front of both sequences or do you look 
>>for a best match? Do you count the extra characters in the longer string 
>>as mismatches or do you ignore them? An example or two would help.
>>For example if
>>how many characters do you count as different?
>>At 07:00 PM 8/29/2004 +0100, Fathima Javeed wrote:
>>>would really appreciate it if someone could help me in Python as i am 
>>>new to the language.
>>>Well i have a list of protein sequences in a text file, e.g. (dummy data)
>>>etc etc
>>>They are not always of the same length,
>>>The first sequence is always the reference sequence which i am tring to 
>>>investigate, basically to reach the objective, i need to compare each 
>>>sequence with the first one, starting with the the comparison of the 
>>>reference sequence by itself.
>>>The objective of the program, is to manupulate each sequence i.e. 
>>>randomly change characters and calculate the distance (Distance: Number 
>>>of letters between a pair of sequnces that dont match  DIVIDED by the 
>>>length of the shortest sequence) between the sequence in question 
>>>against the reference sequence. So therefore need  a program code where 
>>>it takes the first sequence as a reference sequence (constant which is 
>>>on top of the list), first it compares it with itself, then it compares 
>>>with the second sequence, then with the third sequence etc etc  each at a time,
>>>for the first comparison, you take a copy of the ref sequnce and 
>>>manupulate the copied sequence) i.e. randomly changing the letters in 
>>>the sequence, and calculating the distances between them.
>>>(the letters that are used for this are: A R N D C E Q G H I L K M F P S 
>>>T W Y V)
>>>The reference sequence is never altered or manupulated, for the first 
>>>comparison, its the copied version of the reference sequence thats altered.
>>>Randomization is done using different P values
>>>e.g for example (P = probability of change)
>>>if P = 0      no random change has been done
>>>if P = 1.0   all the letters in that particular sequence has been 
>>>randomly changed, therefore p=1.0 equals to the length of the sequence
>>>So its calculating the distance each time between two sequences ( first 
>>>is always the reference sequnce and another second sequence) at each P 
>>>value ( starting from 0, then 0.1, 0.2, ....... 1.0).
>>>Note: Number of sequnces to be compared could be any number and of any 
>>>I dont know how to compare each sequence with the first sequnce and how 
>>>to do randomization of the characters in the sequnce therefore to 
>>>calculate the distance for each pair of sequnce , if someone can give me 
>>>any guidance, I would be greatful
>>>Stay in touch with absent friends - get MSN Messenger 
>>>Tutor maillist  -  Tutor at python.org
>It's fast, it's easy and it's free. Get MSN Messenger today! 

More information about the Tutor mailing list