Mailman 3 reading gzip compressed files using numpy.fromfile - NumPy-Discussion

Oct. 29, 2009


      ...
Date: Wed, 28 Oct 2009 20:31:43 +0100
From: Peter Schmidtke <pschmidtke@mmb.pcb.ub.es>
Subject: [Numpy-discussion] reading gzip compressed files using
  numpy.fromfile
To: numpy-discussion@scipy.org
Message-ID: <fc345224bfa26132e9474287e32e083b@mmb.pcb.ub.es>
Content-Type: text/plain; charset="UTF-8"
Dear Numpy Mailing List Readers,
I have a quite simple problem, for what I did not find a solution for
now. 
I have a gzipped file lying around that has some numbers stored in it and
I
want to read them into a numpy array as fast as possible but only a bunch
of data at a time. 
So I would like to use numpys fromfile funtion.
For now I have somehow the following code :
f=gzip.open( "myfile.gz", "r" )
xyz=npy.fromfile(f,dtype="float32",count=400)
So I would read 400 entries from the file, keep it open, process my data,
come back and read the next 400 entries. If I do this, numpy is
complaining
that the file handle f is not a normal file handle :
OError: first argument must be an open file
but in fact it is a zlib file handle. But gzip gives access to the normal
filehandle through f.fileobj.
So I tried  xyz=npy.fromfile(f.fileobj,dtype="float32",count=400)
But there I get just meaningless values (not the actual data) and when I
specify the sep=" " argument for npy.fromfile I get just .1 and nothing
else.
Can you tell me why and how to fix this problem? I know that I could read
everything to memory, but these files are rather big, so I simply have to
avoid this.
Thanks in advance.
--
Peter Schmidtke
----------------------
PhD Student at the Molecular Modeling and Bioinformatics Group
Dep. Physical Chemistry
Faculty of Pharmacy
University of Barcelona
------------------------------
Message: 2
Date: Wed, 28 Oct 2009 14:33:11 -0500
From: Robert Kern <robert.kern@gmail.com>
Subject: Re: [Numpy-discussion] reading gzip compressed files using
  numpy.fromfile
To: Discussion of Numerical Python <numpy-discussion@scipy.org>
Message-ID:
  <3d375d730910281233r5cadd0fcubea14676a3a978f1@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
On Wed, Oct 28, 2009 at 14:31, Peter Schmidtke <pschmidtke@mmb.pcb.ub.es>
wrote:
...
Dear Numpy Mailing List Readers,
I have a quite simple problem, for what I did not find a solution for
now.
I have a gzipped file lying around that has some numbers stored in it
and
I
want to read them into a numpy array as fast as possible but only a
bunch
of data at a time.
So I would like to use numpys fromfile funtion.
For now I have somehow the following code :
? ? ? ?f=gzip.open( "myfile.gz", "r" )
xyz=npy.fromfile(f,dtype="float32",count=400)
So I would read 400 entries from the file, keep it open, process my
data,
come back and read the next 400 entries. If I do this, numpy is
complaining
that the file handle f is not a normal file handle :
OError: first argument must be an open file
but in fact it is a zlib file handle. But gzip gives access to the
normal
filehandle through f.fileobj.
np.fromfile() requires a true file object, not just a file-like
object. np.fromfile() works by grabbing the FILE* pointer underneath
and using C system calls to read the data, not by calling the .read()
method.
...
So I tried ?xyz=npy.fromfile(f.fileobj,dtype="float32",count=400)
But there I get just meaningless values (not the actual data) and when I
specify the sep=" " argument for npy.fromfile I get just .1 and nothing
else.
This is reading the compressed data, not the data that you want.
...
Can you tell me why and how to fix this problem? I know that I could
read
everything to memory, but these files are rather big, so I simply have
to
avoid this.
Read in reasonably-sized chunks of bytes at a time, and use
np.fromstring() to create arrays from them.
-- 
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco
------------------------------
Message: 3
Date: Wed, 28 Oct 2009 13:26:41 -0700
From: Christopher Barker <Chris.Barker@noaa.gov>
Subject: Re: [Numpy-discussion] reading gzip compressed files	using
  numpy.fromfile
To: Discussion of Numerical Python <numpy-discussion@scipy.org>
Message-ID: <4AE8A901.3060403@noaa.gov>
Content-Type: text/plain; charset=UTF-8; format=flowed
Robert Kern wrote:
...
...
f=gzip.open( "myfile.gz", "r" )
xyz=npy.fromfile(f,dtype="float32",count=400)
...
Read in reasonably-sized chunks of bytes at a time, and use
np.fromstring() to create arrays from them.
Something like:
count = 400
xyz = np.fromstring(f.read(count*4), dtype=np.float32)
should work (untested...)
-Chris
-- 
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception
Chris.Barker@noaa.gov
Thanks Robert and Chris...indeed I managed to read it quite fast this way.

++


Peter Schmidtke

----------------------
PhD Student at the Molecular Modeling and Bioinformatics Group
Dep. Physical Chemistry
Faculty of Pharmacy
University of Barcelona

reading gzip compressed files using numpy.fromfile

Peter Schmidtke

tags

participants (1)