Memory error due to big input file

Mon Jul 13 17:01:17 EDT 2009

sityee kong wrote:
> Hi All,
>
> I have a similar problem that many new python users might encounter. I would
> really appreciate if you could help me fix the error.
> I have a big text file with size more than 2GB. It turned out memory error
> when reading in this file. Here is my python script, the error occurred at
> line -- self.fh.readlines().
>
> import math
> import time
>
> class textfile:
>   def __init__(self,fname):
>      self.name=fname
>      self.fh=open(fname)
>      self.fh.readline()
>      self.lines=self.fh.readlines()
>
> a=textfile("/home/sservice/nfbc/GenoData/CompareCalls3.diff")
>
> lfile=len(a.lines)
>
> def myfun(snp,start,end):
>   subdata=a.lines[start:end+1]
>   NEWmiss=0
>   OLDmiss=0
>   DIFF=0
>   for row in subdata:
>      k=row.split()
>      if (k[3]=="0/0") & (k[4]!="0/0"):
>         NEWmiss=NEWmiss+1
>      elif (k[3]!="0/0") & (k[4]=="0/0"):
>         OLDmiss=OLDmiss+1
>      elif (k[3]!="0/0") & (k[4]!="0/0"):
>         DIFF=DIFF+1
>   result.write(snp+" "+str(NEWmiss)+" "+str(OLDmiss)+" "+str(DIFF)+"\n")
>
> result=open("Summary_noLoop_diff3.txt","w")
> result.write("SNP NEWmiss OLDmiss DIFF\n")
>
> start=0
> snp=0
> for i in range(lfile):
>   if (i==0): continue
>   after=a.lines[i].split()
>   before=a.lines[i-1].split()
>   if (before[0]==after[0]):
>     if (i!=(lfile-1)): continue
>     else:
>       end=lfile-1
>       myfun(before[0],start,end)
>       snp=snp+1
>   else:
>     end=i-1
>     myfun(before[0],start,end)
>     snp=snp+1
>     start=i
>     if (i ==(lfile-1)):
>       myfun(after[0],start,start)
>       snp=snp+1
>
> result.close()
>
>   sincerely, phoebe
>
>   
Others have pointed out that you have too little memory for a 2gig data 
structure.  If you're running on a 32bit system, chances are it won't 
matter how much memory you add, a process is limited to 4gb, and the OS 
typically takes about half of it, your code and other data takes some, 
and you don't have 2gig left.   A 64 bit version of Python, running on a 
64bit OS, might be able to "just work."

Anyway, loading the whole file into a list is seldom the best answer, 
except for files under a meg or so.  It's usually better to process the 
file in sequence.  It looks like you're also making slices of that data, 
so they could potentially be pretty big as well.

If you can be assured that you only need the current line and the 
previous two (for example), then you can use a list of just those three, 
and delete the oldest one, and add a new one to that list each time 
through the loop.

Or, you can add some methods to that 'textfile' class that fetch a line 
by index.  Brute force, you could pre-scan the file, and record all the 
file offsets for the lines you find, rather than storing the actual 
line.  So you still have just as big a list, but it's a list of 
integers.  Then when somebody calls your method, he passes an integer, 
and you return the particular line.  A little caching for performance, 
and you're good to go.

Anyway, if you organize it that way, you can code the rest of the module 
to not care whether the whole file is really in memory or not.

BTW, you should derive all your classes from something.  If nothing 
else, use object.
  class textfile(object):