I need to work on a file whose size is around 6.5 GB. This file consists of a protein header information and then the corresponding protein sequence. Here are a few samples lines of this file:<br><br>-----------<br>>gi|15674171|ref|NP_268346.1| 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis Il1403] gi|116513137|ref|YP_812044.1| 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris SK11] gi|125625229|ref|YP_001033712.1| 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris MG1363] gi|281492845|ref|YP_003354825.1| 50S ribosomal protein S18P [Lactococcus lactis subsp. lactis KF147] gi|13878750|sp|Q9CDN0.1|RS18_LACLA RecName: Full=30S ribosomal protein S18 gi|122939895|sp|Q02VU1.1|RS18_LACLS RecName: Full=30S ribosomal protein S18 gi|166220956|sp|A2RNZ2.1|RS18_LACLM RecName: Full=30S ribosomal protein S18 gi|12725253|gb|AAK06287.1|AE006448_5 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis Il1403] gi|116108791|gb|ABJ73931.1| SSU ribosomal protein S18P [Lactococcus lactis subsp. cremoris SK11] gi|124494037|emb|CAL99037.1| 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris MG1363] gi|281376497|gb|ADA65983.1| SSU ribosomal protein S18P [Lactococcus lactis subsp. lactis KF147] gi|300072039|gb|ADJ61439.1| 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris NZ9000]<br>
MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKRFISERGKILPRRVTGTSAKNQRKVVNAIKRARVMALLPFVAEDQ<br>N<br>>gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P54670.1|CAF1_DICDI RecName: Full=Calfumirin-1; Short=CAF-1 gi|793761|dbj|BAA06266.1| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL68086.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]<br>
MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEY<br>KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQK<br>VQKLLNPDQ<br>>gi|66818355|ref|XP_642837.1| hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4] gi|60470987|gb|EAL68957.1| hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]<br>
MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVAINGWILVGRMKKSSKKAQYE<br>DFYKKMILKSKILLSTIIIVIIVVVVQDIVINFILPQNPQPYVYMIISNFIVGIADSFQMIMVIFVMGELSFKNYFKFKR<br><br>-----------<br>My problem is that i need to filter this file so as to extract the relevant proteins that are of my interest based on some keywords to be applied on the header line. As a preliminary step, i wrote the following code to calculate the total number of lines in the file:<br>
<br>f = open ('nr')<br>count = 0<br>for i in f.readlines():<br> line = f.next().strip()<br> count = count + 1<br>f.close()<br>print count<br><br>On running this program, i get the following error:<br><br>Traceback (most recent call last):<br>
File "C:\Users\K\Downloads\nr\nr.py", line 34, in <module><br> for i in f.readlines():<br>MemoryError<br><br>A slightly modified version of the above program works fine for the first 10 or 100 or 1000 lines of the file nr:<br>
<br><br>----<br><br>Any suggestions on how i can work around this 'Memory Error' problem?<br>