[Tutor] how to do systematic searching in dictionary and printing it

Kent Johnson kent37 at tds.net
Thu Oct 20 19:56:24 CEST 2005


I would do this by making a dictionary mapping sequence to header for each data set. Then make a set that contains the keys common to both data sets. Finally use the dictionaries again to look up the headers.

a = '''>a1
TTAATTGGAACA
>a2
AGGACAAGGATA
>a3
TTAAGGAACAAA'''.split()

# Make a dict mapping sequence to header for the 'a' data set
ak = a[1::2]
av = a[::2]
a_dict = dict(zip(ak,av))
print a_dict

b = '''>b1
TTAATTGGAACA
>b2
AGGTCAAGGATA
>b3
AAGGCCAATTAA'''.split()

# Make a dict mapping sequence to header for the 'b' data set
bk = b[1::2]
bv = b[::2]
b_dict = dict(zip(bk,bv))
print b_dict

# Make a set that contains the keys common to both dicts
common_keys = set(a_dict.iterkeys())
common_keys.intersection_update(b_dict.iterkeys())
print common_keys

# For each common key, print the corresponding headers
for common in common_keys:
    print '%s\t%s' % (a_dict[common], b_dict[common])
 

Kent

Srinivas Iyyer wrote:
> dear group, 
> 
> 
> I have two files in a text format and look this way:
> 
> 
> File a1.txt:
> 
>>a1
> 
> TTAATTGGAACA
> 
>>a2
> 
> AGGACAAGGATA
> 
>>a3
> 
> TTAAGGAACAAA
> 
> 
> 
> File b1.txt:
> 
>>b1
> 
> TTAATTGGAACA
> 
>>b2
> 
> AGGTCAAGGATA
> 
>>b3
> 
> AAGGCCAATTAA
> 
> 
> I want to check if there are common elements based on
> ATGC sequences. a1 and b1 are identical sequences and
> I want to select them and print the headers (starting
> with > symbol). 
> 
> a1 '\t' b1
> 
> 
> 
> Here:
> 
>>XXXXX is called header and the line followed by >line
> 
> is sequence. In bioinformatics, this is called a FASTA
> format.  What I am doing here is, I am matching the
> sequences (these are always 25 mers in this instance)
> and if they match, I am asking python to write the
> header +'\t'+ header
> 
> 
> ak = a[1::2]
> av = a[::2]
> seq_dict = dict(zip(ak,av))
> 
> **************************************
> 
>>>>seq_dict
> 
> {'TTAAGGAACAAA': '>a3', 'AGGACAAGGATA': '>a2',
> 'TTAATTGGAACA': '>a1'}
> **************************************
> 
> 
> 
> bv = b[1::2]  
> 
> ***************************************
> 
>>>>bv
> 
> ['TTAATTGGAACA', 'AGGTCAAGGATA', 'AAGGCCAATTAA']
> 
> 
> 
>>>>for i in bv:
> 
> 	if seq_dict.has_key(i):
> 		print seq_dict[i]
> 
> 		
> 
>>a1
> 
> 
> ***************************************
> 
> Here a1 is the only common element.
> 
> However, I am having difficulty printing that b1 is
> identical to a1
> 
> 
> how do i take b and do this search. It was easy for me
> to take the sequence part by doing
> 
> b[1::2]. however, I want to print b1 header has same
> sequence as a1
> 
> a1 +'\t'+b1
> 
> Is there anyway i can do this. This is very simple and
> due to my brain block, I am unable to get it out. 
> Can any one please help me out. 
> 
> Thanks
> 
> 
> 
> 	
> 		
> __________________________________ 
> Yahoo! Mail - PC Magazine Editors' Choice 2005 
> http://mail.yahoo.com
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
> 
> 



More information about the Tutor mailing list