[Tutor] looping problem

Kent Johnson kent37 at tds.net
Sat Sep 23 22:24:23 CEST 2006


kumar s wrote:
> hi, 
> 
> thank you. this is not a homework question. 
> 
> I have a very huge file of fasta sequence.
> 
> I want to create a dictionary where 'GeneName' as key
> and sequence of ATGC characters as value 
> 
> 
> biglist = dat.split('\t')
> ['GeneName xxxxxxxx','yyyyyyyy','ATTAAGGCCAA'.......]
> 
> Now I want to select ''GeneName xxxxxxxx' into listA
> and 'ATTAAGGCCAA' into listB
> 
> so I want to select 0,3,6,9 elements into listA
> and 2,5,8,11 and so on elements into listB
> 
> then I can do dict(zip(listA,listB))
>
> however, the very loops concept is getting blanked out
> in my brain when I want to do this:
> 
> for j in range(len(biglist)):
>         from here .. I cannot think anything..
> 
> may be it is just mental block.. thats the reason I
> seek help on forum. 

Lloyd has pointed you to slicing as the answer to your immediate 
question. However for the larger question of reading FASTA files, you 
might want to look at CoreBio, this is a new library of Python modules 
for computational biology that looks pretty good.
http://code.google.com/p/corebio/

CoreBio has built-in support for reading FASTA files into Seq objects. 
For example:

In [1]: import corebio.seq_io

In [2]: f=open(r'F:\Bio\BIOE48~1\KENTJO~1\SEQUEN~2\fasta\GI5082~1.FAS')

In [3]: seqs = corebio.seq_io.read(f)

seqs is now a list of Seq objects for each sequence in the original file
In this case there is only one sequence but it will work for your file also.

In [4]: for seq in seqs:
    ...:     print seq.name
    ...:     print seq
    ...:
    ...:
gi|50826|emb|CAA28242.1|
MIRTLLLSALVAGALSCGYPTYEVEDDVSRVVGGQEATPNTWPWQVSLQVLSSGRWRHNCGGSLVANNWVLTAAHCLSNYQTYRVLLGAHSLSNPGAGSAAVQVSKLVVHQRWNSQNVGNGYDIALIKLASPVTLSKNIQTACLPPAGTI
LPRNYVCYVTGWGLLQTNGNSPDTLRQGRLLVVDYATCSSASWWGSSVKSSMVCAGGDGVTSSCNGDSGGPLNCRASNGQWQVHGIVSFGSSLGCNYPRKPSVFTRVSNYIDWINSVMARN

In your case, you want a dict whose keys are the sequence name up to the 
first tab, and the values are the actual sequences. Something like this 
should work:
d = dict( (seq.name.split('\t')[0], seq) for seq in seqs)

The Seq class is a string subclass so putting the seq in the dict is 
what you want.

There is also an iterator to read sequences one at a time, this might be 
a little faster and more memory efficient because it doesn't have to 
create the big list of all sequences. Something like this (untested):

from corebio.seq_io.fasta_io import iterseq
f = open(...)
d = dict( (seq.name.split('\t')[0], seq) for seq in iterseq(f))

Kent



More information about the Tutor mailing list