[Baypiggies] reading files quickly and efficiently

Tung Wai Yip tungwaiyip at yahoo.com
Wed Nov 17 21:54:42 CET 2010


readlines() will read the entire file in memory. Use f directly as a  
iterator

# not tested!
f = open ('nr')
count = 0
for line in f:
     count = count + 1
f.close()
print count

Wai Yip


> I need to work on a file whose size is around 6.5 GB.  This file  
> consists of
> a protein header information and then the corresponding protein sequence.
> Here are a few samples lines of this file:
>
> -----------
>> gi|15674171|ref|NP_268346.1| 30S ribosomal protein S18 [Lactococcus  
>> lactis
> subsp. lactis Il1403] gi|116513137|ref|YP_812044.1| 30S ribosomal protein
> S18 [Lactococcus lactis subsp. cremoris SK11]
> gi|125625229|ref|YP_001033712.1| 30S ribosomal protein S18 [Lactococcus
> lactis subsp. cremoris MG1363] gi|281492845|ref|YP_003354825.1| 50S
> ribosomal protein S18P [Lactococcus lactis subsp. lactis KF147]
> gi|13878750|sp|Q9CDN0.1|RS18_LACLA RecName: Full=30S ribosomal protein  
> S18
> gi|122939895|sp|Q02VU1.1|RS18_LACLS RecName: Full=30S ribosomal protein  
> S18
> gi|166220956|sp|A2RNZ2.1|RS18_LACLM RecName: Full=30S ribosomal protein  
> S18
> gi|12725253|gb|AAK06287.1|AE006448_5 30S ribosomal protein S18  
> [Lactococcus
> lactis subsp. lactis Il1403] gi|116108791|gb|ABJ73931.1| SSU ribosomal
> protein S18P [Lactococcus lactis subsp. cremoris SK11]
> gi|124494037|emb|CAL99037.1| 30S ribosomal protein S18 [Lactococcus  
> lactis
> subsp. cremoris MG1363] gi|281376497|gb|ADA65983.1| SSU ribosomal protein
> S18P [Lactococcus lactis subsp. lactis KF147] gi|300072039|gb|ADJ61439.1|
> 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris NZ9000]
> MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKRFISERGKILPRRVTGTSAKNQRKVVNAIKRARVMALLPFVAEDQ
> N
>> gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827
> [Dictyostelium discoideum AX4] gi|1705556|sp|P54670.1|CAF1_DICDI RecName:
> Full=Calfumirin-1; Short=CAF-1 gi|793761|dbj|BAA06266.1| calfumirin-1
> [Dictyostelium discoideum] gi|60470106|gb|EAL68086.1| hypothetical  
> protein
> DDB_G0277827 [Dictyostelium discoideum AX4]
> MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEY
> KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQK
> VQKLLNPDQ
>> gi|66818355|ref|XP_642837.1| hypothetical protein DDB_G0276911
> [Dictyostelium discoideum AX4] gi|60470987|gb|EAL68957.1| hypothetical
> protein DDB_G0276911 [Dictyostelium discoideum AX4]
> MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVAINGWILVGRMKKSSKKAQYE
> DFYKKMILKSKILLSTIIIVIIVVVVQDIVINFILPQNPQPYVYMIISNFIVGIADSFQMIMVIFVMGELSFKNYFKFKR
>
> -----------
> My problem is that i need to filter this file so as to extract the  
> relevant
> proteins that are of my interest based on some keywords to be applied on  
> the
> header line. As a preliminary step, i wrote the following code to  
> calculate
> the total number of lines in the file:
>
> f = open ('nr')
> count = 0
> for i in f.readlines():
>     line = f.next().strip()
>     count = count + 1
> f.close()
> print count
>
> On running this program, i get the following error:
>
> Traceback (most recent call last):
>   File "C:\Users\K\Downloads\nr\nr.py", line 34, in <module>
>     for i in f.readlines():
> MemoryError
>
> A slightly modified version of the above program works fine for the  
> first 10
> or 100 or 1000 lines of the file nr:
>
>
> ----
>
> Any suggestions on how i can work around this 'Memory Error' problem?


More information about the Baypiggies mailing list