[Numpy-discussion] Loading a > GB file into array

Martin Spacek numpy at mspacek.mm.st
Fri Nov 30 03:47:41 EST 2007


I need to load a 1.3GB binary file entirely into a single numpy.uint8
array. I've been using numpy.fromfile(), but for files > 1.2GB on my
win32 machine, I get a memory error. Actually, since I have several
other python modules imported at the same time, including pygame, I get
a "pygame parachute" and a segfault that dumps me out of python:

data = numpy.fromfile(f, numpy.uint8) # where f is the open file

1382400000 items requested but only 0 read
Fatal Python error: (pygame parachute) Segmentation Fault

If I stick to just doing it at the interpreter with only numpy imported,
I can open up files that are roughly 100MB bigger, but any more than
that and I get a clean MemoryError. This machine has 2GB of RAM. I've
tried setting the /3GB switch on winxp bootup, as well as all the
registry suggestions at
http://www.msfn.org/board/storage-process-command-t62001.html. No luck.
I get the same error in (32bit) ubuntu for a sufficiently big file.

I find that if I load the file in two pieces into two arrays, say 1GB
and 0.3GB respectively, I can avoid the memory error. So it seems that
it's not that windows can't allocate the memory, just that it can't
allocate enough contiguous memory. I'm OK with this, but for indexing
convenience, I'd like to be able to treat the two arrays as if they were
one. Specifically, this file is movie data, and the array I'd like to
get out of this is of shape (nframes, height, width). Right now I'm
getting two arrays that are something like (0.8*nframes, height, width)
and (0.2*nframes, height, width). Later in the code, I only need to
index over the 0th dimension, i.e. the frame index.

I'd like to access all the data using a single range of frame indices.
Is there any way to combine these two arrays into what looks like a
single array, without having to do any copying within memory? I've tried
using numpy.concatenate(), but that gives me a MemoryError because, I
presume, it's doing a copy. Would it be better to load the file one
frame at a time, generating nframes arrays of shape (height, width), and
sticking them consecutively in a python list?

I'm using numpy 1.0.4 (compiled from source tarball with Intel's MKL
library) on python 2.5.1 in winxp.

Thanks for any advice,

Martin




More information about the NumPy-Discussion mailing list