python reading file memory cost
Chris Rebert
clp2 at rebertia.com
Tue Aug 2 02:55:15 EDT 2011
On Mon, Aug 1, 2011 at 8:22 PM, Tony Zhang <warriorlance at gmail.com> wrote:
> Thanks!
>
> Actually, I used .readline() to parse file line by line, because I need
> to find out the start position to extract data into list, and the end
> point to pause extracting, then repeat until the end of file.
> My file to read is formatted like this:
>
> blabla...useless....
> useless...
>
> /sign/
> data block(e.g. 10 cols x 1000 rows)
> ...
> blank line
> /sign/
> data block(e.g. 10 cols x 1000 rows)
> ...
> blank line
> ...
> ...
> EOF
> let's call this file 'myfile'
> and my python snippet:
>
> f=open('myfile','r')
> blocknum=0 #number the data block
> data=[]
> while True"
> # find the extract begnning
> while not f.readline().startswith('/a1/'):pass
> # creat multidimensional list to store data block
> data=append([])
> blocknum +=1
> line=f.readline()
>
> while line.strip():
> # check if the line is a blank line, i.e the end of one block
> data[blocknum-1].append(["2.6E" %float(x) for x in line.split()])
> line = f.readline()
> print "Read Block %d" %blocknum
> if not f.readline(): break
>
> The running result was that read a 500M file consume almost 2GB RAM, I
> cannot figure it out, somebody help!
If you could store the floats themselves, rather than their string
representations, that would be more space-efficient. You could then
also use the `array` module, which is more space-efficient than lists
(http://docs.python.org/library/array.html ). Numpy would also be
worth investigating since multidimensional arrays are involved.
The next obvious question would then be: do you /really/ need /all/ of
the data in memory at once?
Also, just so you're aware:
http://docs.python.org/library/sys.html#sys.getsizeof
Cheers,
Chris
--
http://rebertia.com
More information about the Python-list
mailing list