recombination variations

Tue Nov 30 06:06:45 EST 2004

The problem I'm solving is to take a sequence like 'ATSGS' and make all 
the DNA sequences it represents.  The A, T, and G are fine but the S 
represents C or G.  I want to take this input:

[ [ 'A' ] , [ 'T' ] , [ 'C' , 'G' ], [ 'G' ] , [ 'C' , 'G' ] ]

and make the list:

[ 'ATCGC' , 'ATCGG' , 'ATGGC' , 'ATGGG' ]

The code below is what I have so far:  'alphabet' is a dictionary that 
designates the set oif base pairs that each letter represents (for 
example for S above it gives C and G).  I call these ambiguous base 
pairs because they could be more then one.  Thus the function name 
'unambiguate'.  It makes a list of sequences with only A T C and Gs and 
none of the ambiguous base pair designations.

	The function 'unambiguate_bp' takes a sequence and a base pair in it 
and returns a set of sequences with that base pair replaced by each of 
it's unambiguous possibilities.

	The function unambiguate_seq takes a sequence and runs unambiguate_bp 
on each base pair in the sequence.  Each time it does a base pair it 
replaces the set of things it's working on with the output from the 
unambiguate_bp.  It's a bit confusing.  I'd like it to be clearer.

Is there a better way to do this?
--
David Siedband
generation-xml.com

def unambiguate_bp(seq, bp):
     seq_set = []
     for i in alphabet[seq[bp]]:
         seq_set.append(seq[:bp]+i+seq[bp+1:])
     return seq_set

def unambiguate_seq(seq):
         result = [seq]
         for i in range(len(seq)):
             result_tmp=[]
             for j in result:
                 result_tmp = result_tmp + unambiguate_bp(j,i)
            result = result_tmp
     return result

alphabet = {
	'A' : ['A'],
	'T' : ['T'],
	'C' : ['C'],
	'G' : ['G'],
     	'W' : ['A','T'],
	'M' : ['A','C'],
	'R' : ['A','G'],
     	'Y' : ['T','C'],
	'K' : ['T','G'],
	'S' : ['C','G'],
     	'H' : ['A','T','C'],
	'D' : ['A','T','G'],
	'V': ['A','G','C'],
	'B' : ['C','T','G'],
     	'N' : ['A','T','C','G']
	}