Memory issues when storing as List of Strings vs List of List
Ben Finney
ben+python at benfinney.id.au
Tue Nov 30 15:19:32 EST 2010
OW Ghim Siong <owgs at bii.a-star.edu.sg> writes:
> I have a big file 1.5GB in size, with about 6 million lines of
> tab-delimited data. I have to perform some filtration on the data and
> keep the good data. After filtration, I have about 5.5 million data
> left remaining. As you might already guessed, I have to read them in
> batches and I did so using .readlines(100000000).
Why do you need to handle the batching in your code? Perhaps you're not
aware that a file object is already an iterator for the lines of text in
the file.
> After reading each batch, I will split the line (in string format) to
> a list using .split("\t") and then check several conditions, after
> which if all conditions are satisfied, I will store the list into a
> matrix.
As I understand it, you don't need a line after moving to the next. So
there's no need to maintain a manual buffer of lines at all; please
explain if there is something additional requiring a huge buffer of
input lines.
> The code is as follows:
> -----Start------
> a=open("bigfile")
> matrix=[]
> while True:
> lines = a.readlines(100000000)
> for line in lines:
> data=line.split("\t")
> if several_conditions_are_satisfied:
> matrix.append(data)
> print "Number of lines read:", len(lines), "matrix.__sizeof__:",
> matrix.__sizeof__()
> if len(lines)==0:
> break
> -----End-----
Using the file's native line iterator::
infile = open("bigfile")
matrix = []
for line in infile:
record = line.split("\t")
if several_conditions_are_satisfied:
matrix.append(record)
> Results:
> Number of lines read: 461544 matrix.__sizeof__: 1694768
> Number of lines read: 449840 matrix.__sizeof__: 3435984
> Number of lines read: 455690 matrix.__sizeof__: 5503904
> Number of lines read: 451955 matrix.__sizeof__: 6965928
> Number of lines read: 452645 matrix.__sizeof__: 8816304
> Number of lines read: 448555 matrix.__sizeof__: 9918368
>
> Traceback (most recent call last):
> MemoryError
If you still get a MemoryError, you can use the ‘pdb’ module
<URL:http://docs.python.org/library/pdb.html> to debug it interactively.
Another option is to catch the MemoryError and construct a diagnostic
message similar to the one you had above::
import sys
infile = open("bigfile")
matrix = []
for line in infile:
record = line.split("\t")
if several_conditions_are_satisfied:
try:
matrix.append(record)
except MemoryError:
matrix_len = len(matrix)
sys.stderr.write(
"len(matrix): %(matrix_len)d\n" % vars())
raise
> I have tried creating such a matrix of equivalent size and it only
> uses 35mb of memory but I am not sure why when using the code above,
> the memory usage shot up so fast and exceeded 2GB.
>
> Any advice is greatly appreciated.
With large data sets, and the manipulation and computation you will
likely be wanting to perform, it's probably time to consider the NumPy
library <URL:http://numpy.scipy.org/> which has much more powerful array
types, part of the SciPy library <URL:http://www.scipy.org/>.
--
\ “[It's] best to confuse only one issue at a time.” —Brian W. |
`\ Kernighan, Dennis M. Ritchie, _The C programming language_, 1988 |
_o__) |
Ben Finney
More information about the Python-list
mailing list