[Baypiggies] reading files quickly and efficiently

Glen Jarvis glen at glenjarvis.com
Wed Nov 17 22:13:00 CET 2010


BioPython also will do all of this for you -- too:

>>> from Bio import SeqIO

>>> record = SeqIO.read("NC_005816.fna", "fasta")
>>> record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',
SingleLetterAlphabet()), id='gi|45478711|ref|NC_005816.1|',
name='gi|45478711|ref|NC_005816.1|',
description='gi|45478711|ref|NC_005816.1| Yersinia pestis biovar
Microtus ... sequence',
dbxrefs=[])


You can also look for particular fields (record.id, record.description, and
record.sequence):


Look at this tutorial:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc16


Cheers,


Glen


On Wed, Nov 17, 2010 at 12:54 PM, Tung Wai Yip <tungwaiyip at yahoo.com> wrote:

> readlines() will read the entire file in memory. Use f directly as a
> iterator
>
> # not tested!
>
> f = open ('nr')
> count = 0
> for line in f:
>
>    count = count + 1
> f.close()
> print count
>
> Wai Yip
>
>
>
>  I need to work on a file whose size is around 6.5 GB.  This file consists
>> of
>> a protein header information and then the corresponding protein sequence.
>> Here are a few samples lines of this file:
>>
>> -----------
>>
>>> gi|15674171|ref|NP_268346.1| 30S ribosomal protein S18 [Lactococcus
>>> lactis
>>>
>> subsp. lactis Il1403] gi|116513137|ref|YP_812044.1| 30S ribosomal protein
>> S18 [Lactococcus lactis subsp. cremoris SK11]
>> gi|125625229|ref|YP_001033712.1| 30S ribosomal protein S18 [Lactococcus
>> lactis subsp. cremoris MG1363] gi|281492845|ref|YP_003354825.1| 50S
>> ribosomal protein S18P [Lactococcus lactis subsp. lactis KF147]
>> gi|13878750|sp|Q9CDN0.1|RS18_LACLA RecName: Full=30S ribosomal protein S18
>> gi|122939895|sp|Q02VU1.1|RS18_LACLS RecName: Full=30S ribosomal protein
>> S18
>> gi|166220956|sp|A2RNZ2.1|RS18_LACLM RecName: Full=30S ribosomal protein
>> S18
>> gi|12725253|gb|AAK06287.1|AE006448_5 30S ribosomal protein S18
>> [Lactococcus
>> lactis subsp. lactis Il1403] gi|116108791|gb|ABJ73931.1| SSU ribosomal
>> protein S18P [Lactococcus lactis subsp. cremoris SK11]
>> gi|124494037|emb|CAL99037.1| 30S ribosomal protein S18 [Lactococcus lactis
>> subsp. cremoris MG1363] gi|281376497|gb|ADA65983.1| SSU ribosomal protein
>> S18P [Lactococcus lactis subsp. lactis KF147] gi|300072039|gb|ADJ61439.1|
>> 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris NZ9000]
>>
>> MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKRFISERGKILPRRVTGTSAKNQRKVVNAIKRARVMALLPFVAEDQ
>> N
>>
>>> gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827
>>>
>> [Dictyostelium discoideum AX4] gi|1705556|sp|P54670.1|CAF1_DICDI RecName:
>> Full=Calfumirin-1; Short=CAF-1 gi|793761|dbj|BAA06266.1| calfumirin-1
>> [Dictyostelium discoideum] gi|60470106|gb|EAL68086.1| hypothetical protein
>> DDB_G0277827 [Dictyostelium discoideum AX4]
>>
>> MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEY
>>
>> KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQK
>> VQKLLNPDQ
>>
>>> gi|66818355|ref|XP_642837.1| hypothetical protein DDB_G0276911
>>>
>> [Dictyostelium discoideum AX4] gi|60470987|gb|EAL68957.1| hypothetical
>> protein DDB_G0276911 [Dictyostelium discoideum AX4]
>>
>> MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVAINGWILVGRMKKSSKKAQYE
>>
>> DFYKKMILKSKILLSTIIIVIIVVVVQDIVINFILPQNPQPYVYMIISNFIVGIADSFQMIMVIFVMGELSFKNYFKFKR
>>
>> -----------
>> My problem is that i need to filter this file so as to extract the
>> relevant
>> proteins that are of my interest based on some keywords to be applied on
>> the
>> header line. As a preliminary step, i wrote the following code to
>> calculate
>> the total number of lines in the file:
>>
>> f = open ('nr')
>> count = 0
>> for i in f.readlines():
>>    line = f.next().strip()
>>    count = count + 1
>> f.close()
>> print count
>>
>> On running this program, i get the following error:
>>
>> Traceback (most recent call last):
>>  File "C:\Users\K\Downloads\nr\nr.py", line 34, in <module>
>>    for i in f.readlines():
>> MemoryError
>>
>> A slightly modified version of the above program works fine for the first
>> 10
>> or 100 or 1000 lines of the file nr:
>>
>>
>> ----
>>
>> Any suggestions on how i can work around this 'Memory Error' problem?
>>
> _______________________________________________
> Baypiggies mailing list
> Baypiggies at python.org
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies
>



-- 
Whatever you can do or imagine, begin it;
boldness has beauty, magic, and power in it.

-- Goethe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20101117/54c2228c/attachment-0001.html>


More information about the Baypiggies mailing list