[Baypiggies] reading files quickly and efficiently

Zachary Collins recursive.cookie.jar at gmail.com
Wed Nov 17 21:44:32 CET 2010


Yes.  The read lines function will load the whole file into memory
before doing the line split.

How about just using read with a small buffer size and incrementally
counting newlines that way?

2010/11/17 Vikram K <kpguy1975 at gmail.com>:
> I need to work on a file whose size is around 6.5 GB.  This file consists of
> a protein header information and then the corresponding protein sequence.
> Here are a few samples lines of this file:
>
> -----------
>>gi|15674171|ref|NP_268346.1| 30S ribosomal protein S18 [Lactococcus lactis
>> subsp. lactis Il1403] gi|116513137|ref|YP_812044.1| 30S ribosomal protein
>> S18 [Lactococcus lactis subsp. cremoris SK11]
>> gi|125625229|ref|YP_001033712.1| 30S ribosomal protein S18 [Lactococcus
>> lactis subsp. cremoris MG1363] gi|281492845|ref|YP_003354825.1| 50S
>> ribosomal protein S18P [Lactococcus lactis subsp. lactis KF147]
>> gi|13878750|sp|Q9CDN0.1|RS18_LACLA RecName: Full=30S ribosomal protein S18
>> gi|122939895|sp|Q02VU1.1|RS18_LACLS RecName: Full=30S ribosomal protein S18
>> gi|166220956|sp|A2RNZ2.1|RS18_LACLM RecName: Full=30S ribosomal protein S18
>> gi|12725253|gb|AAK06287.1|AE006448_5 30S ribosomal protein S18 [Lactococcus
>> lactis subsp. lactis Il1403] gi|116108791|gb|ABJ73931.1| SSU ribosomal
>> protein S18P [Lactococcus lactis subsp. cremoris SK11]
>> gi|124494037|emb|CAL99037.1| 30S ribosomal protein S18 [Lactococcus lactis
>> subsp. cremoris MG1363] gi|281376497|gb|ADA65983.1| SSU ribosomal protein
>> S18P [Lactococcus lactis subsp. lactis KF147] gi|300072039|gb|ADJ61439.1|
>> 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris NZ9000]
> MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKRFISERGKILPRRVTGTSAKNQRKVVNAIKRARVMALLPFVAEDQ
> N
>>gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827
>> [Dictyostelium discoideum AX4] gi|1705556|sp|P54670.1|CAF1_DICDI RecName:
>> Full=Calfumirin-1; Short=CAF-1 gi|793761|dbj|BAA06266.1| calfumirin-1
>> [Dictyostelium discoideum] gi|60470106|gb|EAL68086.1| hypothetical protein
>> DDB_G0277827 [Dictyostelium discoideum AX4]
> MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEY
> KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQK
> VQKLLNPDQ
>>gi|66818355|ref|XP_642837.1| hypothetical protein DDB_G0276911
>> [Dictyostelium discoideum AX4] gi|60470987|gb|EAL68957.1| hypothetical
>> protein DDB_G0276911 [Dictyostelium discoideum AX4]
> MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVAINGWILVGRMKKSSKKAQYE
> DFYKKMILKSKILLSTIIIVIIVVVVQDIVINFILPQNPQPYVYMIISNFIVGIADSFQMIMVIFVMGELSFKNYFKFKR
>
> -----------
> My problem is that i need to filter this file so as to extract the relevant
> proteins that are of my interest based on some keywords to be applied on the
> header line. As a preliminary step, i wrote the following code to calculate
> the total number of lines in the file:
>
> f = open ('nr')
> count = 0
> for i in f.readlines():
>     line = f.next().strip()
>     count = count + 1
> f.close()
> print count
>
> On running this program, i get the following error:
>
> Traceback (most recent call last):
>   File "C:\Users\K\Downloads\nr\nr.py", line 34, in <module>
>     for i in f.readlines():
> MemoryError
>
> A slightly modified version of the above program works fine for the first 10
> or 100 or 1000 lines of the file nr:
>
>
> ----
>
> Any suggestions on how i can work around this 'Memory Error' problem?
>
> _______________________________________________
> Baypiggies mailing list
> Baypiggies at python.org
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies
>


More information about the Baypiggies mailing list