how to pipe into numpy arrays?
As numpy.fromfile seems to require full file object functionalities like seek, I can not use it with the sys.stdin pipe. So how could I stream a binary pipe directly into numpy? I can imagine storing the data in a string and use StringIO but the files are 3.6 GB large, just the binary, and that will most likely be much more as a string object. Reading binary files on disk is NOT the problem, I would like to avoid the temporary file if possible.
On Wed, Oct 24, 2012 at 3:00 PM, Michael Aye
As numpy.fromfile seems to require full file object functionalities like seek, I can not use it with the sys.stdin pipe. So how could I stream a binary pipe directly into numpy? I can imagine storing the data in a string and use StringIO but the files are 3.6 GB large, just the binary, and that will most likely be much more as a string object. Reading binary files on disk is NOT the problem, I would like to avoid the temporary file if possible.
I haven't tried this myself, but there is a numpy.frombuffer() function as well. Maybe that could be used here? Cheers! Ben Root
On 10/24/2012 09:00 PM, Michael Aye wrote:
As numpy.fromfile seems to require full file object functionalities like seek, I can not use it with the sys.stdin pipe. So how could I stream a binary pipe directly into numpy? I can imagine storing the data in a string and use StringIO but the files are 3.6 GB large, just the binary, and that will most likely be much more as a string object.
A Python 2 string is just a bytes object and would take 3.6 GB as well (or did you mean in text encoding?)
Reading binary files on disk is NOT the problem, I would like to avoid the temporary file if possible.
Read in chunks? Something like 1) Create array arr 2) arr_bytes = arr.view(np.uint8).reshape(np.prod(arr.shape)) # check that modifying arr_bytes modifies arr, # if not, work with reshape arguments 3) while not done: arr_bytes[i:i + chunk_size] = f.read(chunk_size) ... Alternatively, one could write some C or Cython code to read directly into the NumPy array buffer, which avoids an extra copy over the memory bus of the data. (Since unfortunately it doesn't look like "fromfile" has an out argument.) Dag Sverre
On 10/25/2012 08:17 AM, Dag Sverre Seljebotn wrote:
On 10/24/2012 09:00 PM, Michael Aye wrote:
As numpy.fromfile seems to require full file object functionalities like seek, I can not use it with the sys.stdin pipe. So how could I stream a binary pipe directly into numpy? I can imagine storing the data in a string and use StringIO but the files are 3.6 GB large, just the binary, and that will most likely be much more as a string object.
A Python 2 string is just a bytes object and would take 3.6 GB as well (or did you mean in text encoding?)
Reading binary files on disk is NOT the problem, I would like to avoid the temporary file if possible.
Read in chunks? Something like
1) Create array arr
2)
arr_bytes = arr.view(np.uint8).reshape(np.prod(arr.shape)) # check that modifying arr_bytes modifies arr, # if not, work with reshape arguments
3)
while not done: arr_bytes[i:i + chunk_size] = f.read(chunk_size) ...
Alternatively, one could write some C or Cython code to read directly into the NumPy array buffer, which avoids an extra copy over the memory bus of the data. (Since unfortunately it doesn't look like "fromfile" has an out argument.)
Actually, as long as you make sure chunk_size is on the order of 1 MB or so, the Python overhead may not matter and the chunks fit in cache so an extra copy is avoided, so a C solution may be overkill. Dag Sverre
participants (3)
-
Benjamin Root
-
Dag Sverre Seljebotn
-
Michael Aye