[Numpy-discussion] fast numpy.fromfile skipping data chunks

Wed Mar 13 12:53:07 EDT 2013

> Since the files are huge, and would make me run out of memory, I need to
read data skipping some records

Is it possible to describe what you're doing with the data once you have
subsampled it? And if there were a way to work with the full resolution
data, would that be desirable?

I ask because I've been dabbling with a pure-Python library for handilng
larger-than-memory datasets - https://github.com/SciTools/biggus, and it
uses similar chunking techniques as mentioned in the other replies to
process data at the full streaming I/O rate. It's still in the early stages
of development so the design can be fluid, so maybe it's worth seeing if
there's enough in common with your needs to warrant adding your use case.

Richard

On 13 March 2013 13:45, Andrea Cimatoribus <Andrea.Cimatoribus at nioz.nl>wrote:

> Hi everybody, I hope this has not been discussed before, I couldn't find a
> solution elsewhere.
> I need to read some binary data, and I am using numpy.fromfile to do this.
> Since the files are huge, and would make me run out of memory, I need to
> read data skipping some records (I am reading data recorded at high
> frequency, so basically I want to read subsampling).
> At the moment, I came up with the code below, which is then compiled using
> cython. Despite the significant performance increase from the pure python
> version, the function is still much slower than numpy.fromfile, and only
> reads one kind of data (in this case uint32), otherwise I do not know how
> to define the array type in advance. I have basically no experience with
> cython nor c, so I am a bit stuck. How can I try to make this more
> efficient and possibly more generic?
> Thanks
>
> import numpy as np
> #For cython!
> cimport numpy as np
> from libc.stdint cimport uint32_t
>
> def cffskip32(fid, int count=1, int skip=0):
>
>     cdef int k=0
>     cdef np.ndarray[uint32_t, ndim=1] data = np.zeros(count,
> dtype=np.uint32)
>
>     if skip>=0:
>         while k<count:
>             try:
>                 data[k] = np.fromfile(fid, count=1, dtype=np.uint32)
>                 fid.seek(skip, 1)
>                 k +=1
>             except ValueError:
>                 data = data[:k]
>                 break
>         return data
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20130313/a15e94e4/attachment.html>