striding through arbitrarily large files

I was struggling with methods of reading large disk files into numpy efficiently (not FITS or .npy, just raw files of IEEE floats from numpy.tostring()). When loading arbitrarily large files it would be nice to not bother reading more than the plot can display before zooming in. There apparently are no built in methods that allow skipping/striding... With a 2GB file, I want to read n (like 4096) evenly sampled points out of it. I tried making a dtype, and other tricks, to read "Pythonically", but failed. I broke down and used a for loop with fh.seek() and fromfile() The file will be open()ed once, but data read many times. num_channels = 9 desired_len = 4096 bytes_per_val = numpy.dtype(numpy.float32).itemsize f_obj = open(path, 'rb') f_obj.seek(0,2) file_length = f_obj.tell() f_obj.seek(0,0) bytes_per_smp = num_channels * bytes_per_val num_samples = file_length / bytes_per_smp stride_smps = num_samples / desired_len ## an int stride_bytes = (stride_smps - 1) * bytes_per_smp arr = numpy.zeros((desired_len, 9)) for i in range(0, desired_len, 1): f_obj.seek(i*stride_bytes, 0) arr[i] = numpy.fromfile(f_obj, dtype='f32', count=9) So, is there a better way to move the pointer through the file without a for loop? A generator? The dtype and other methods like mmap fail with memoryError since they still try to load the whole file, although apparently you can mmap with 64bit systems, which I might try soon with a new 64bit install. - Ray

On 04/02/2014 16:01, RayS wrote:
I was struggling with methods of reading large disk files into numpy efficiently (not FITS or .npy, just raw files of IEEE floats from numpy.tostring()). When loading arbitrarily large files it would be nice to not bother reading more than the plot can display before zooming in. There apparently are no built in methods that allow skipping/striding...
If you mmap the data file with np.memmap() you can access the data in a strided way through the numpy array interface and the OS will handle the scheduling of the reads from the disc. Note however if that the data samples you need are quite dense, there is no real advantage in doing this because the OS will have to read a whole page anyway for each read. Cheers, Daniele

At 07:09 AM 2/4/2014, you wrote:
On 04/02/2014 16:01, RayS wrote:
I was struggling with methods of reading large disk files into numpy efficiently (not FITS or .npy, just raw files of IEEE floats from numpy.tostring()). When loading arbitrarily large files it would be nice to not bother reading more than the plot can display before zooming in. There apparently are no built in methods that allow skipping/striding...
If you mmap the data file with np.memmap() you can access the data in a strided way through the numpy array interface and the OS will handle the scheduling of the reads from the disc.
Note however if that the data samples you need are quite dense, there is no real advantage in doing this because the OS will have to read a whole page anyway for each read.
Thanks Daniele, I'll be trying mmap with Python64. With 32 bit the mmap method throws MemoryError with 2.5GB files... The idea is that we allow the users to inspect the huge files graphically, then they can "zoom" into regions of interest and then load a ~100 MB en block for the usual spectral analysis. - Ray

On Tue, Feb 4, 2014 at 4:27 PM, RayS <rays@blue-cove.com> wrote:
At 07:09 AM 2/4/2014, you wrote:
On 04/02/2014 16:01, RayS wrote:
I was struggling with methods of reading large disk files into numpy efficiently (not FITS or .npy, just raw files of IEEE floats from numpy.tostring()). When loading arbitrarily large files it would be nice to not bother reading more than the plot can display before zooming in. There apparently are no built in methods that allow skipping/striding...
If you mmap the data file with np.memmap() you can access the data in a strided way through the numpy array interface and the OS will handle the scheduling of the reads from the disc.
Note however if that the data samples you need are quite dense, there is no real advantage in doing this because the OS will have to read a whole page anyway for each read.
Thanks Daniele, I'll be trying mmap with Python64. With 32 bit the mmap method throws MemoryError with 2.5GB files... The idea is that we allow the users to inspect the huge files graphically, then they can "zoom" into regions of interest and then load a ~100 MB en block for the usual spectral analysis.
memory maps are limited to the size of the available address space (31 bits with sign), so you would have to slide them, see e.g. the smmap module. But its not likely this is going to be much faster than a loop with explicit seeks depending on the sparseness of the data. memory maps have relatively high overheads at the operating system level.

RayS <rays@blue-cove.com> wrote:
Thanks Daniele, I'll be trying mmap with Python64. With 32 bit the mmap method throws MemoryError with 2.5GB files... The idea is that we allow the users to inspect the huge files graphically, then they can "zoom" into regions of interest and then load a ~100 MB en block for the usual spectral analysis.
Transfer the file to Pytables, and you can zoom as you like. Or use a recent Python so mmap actually has an offset argument. Sturla
participants (4)
-
Daniele Nicolodi
-
Julian Taylor
-
RayS
-
Sturla Molden