[Numpy-discussion] About the npz format

Valentin Haenel valentin at haenel.co
Thu Apr 17 16:56:27 EDT 2014


* Valentin Haenel <valentin at haenel.co> [2014-04-17]:
> Hi,
> 
> * Julian Taylor <jtaylor.debian at googlemail.com> [2014-04-17]:
> > On 17.04.2014 21:30, onefire wrote:
> > > Hi Nathaniel,
> > > 
> > > Thanks for the suggestion. I did profile the program before, just not
> > > using Python.
> > 
> > one problem of npz is that the zipfile module does not support streaming
> > data in (or if it does now we aren't using it).
> > So numpy writes the file uncompressed to disk and then zips it which is
> > horrible for performance and disk usage.
> 
> As a workaround may also be possible to write the temporary NPY files to
> cStringIO instances and then use ``ZipFile.writestr`` with the
> ``getvalue()`` of the cStringIO object. However that approach may
> require some memory. In python 2.7, for each array: one copy inside the
> cStringIO instance and then another copy of when calling getvalue on the
> cString, I believe.

There is a proof-of-concept implementation here:

https://github.com/esc/numpy/compare/feature;npz_no_temp_file

Here are the timings, again using ``sync()`` from bloscpack (but it's
just a ``os.system('sync')``, in case you want to run your own
benchmarks):

In [1]: import numpy as np

In [2]: import bloscpack.sysutil as bps

In [3]: x = np.linspace(1, 10, 50000000)

In [4]: %timeit np.save("x.npy", x) ; bps.sync()
1 loops, best of 3: 1.93 s per loop

In [5]: %timeit np.savez("x.npz", x) ; bps.sync()
1 loops, best of 3: 7.88 s per loop

In [6]: %timeit np._savez_no_temp("x.npy", [x], {}, False) ; bps.sync()
1 loops, best of 3: 3.22 s per loop

Not too bad, but still slower than plain NPY, memory copies would be my
guess.

V-

PS: Running Python 2.7.6 :: Anaconda 1.9.2 (64-bit) and Numpy master



More information about the NumPy-Discussion mailing list