Re: [Numpy-discussion] About the npz format

17 Apr 2014

      * Valentin Haenel  [2014-04-17]:
...
Hi,
* Julian Taylor  [2014-04-17]:
...
On 17.04.2014 21:30, onefire wrote:
...
Hi Nathaniel,
Thanks for the suggestion. I did profile the program before, just not
using Python.
one problem of npz is that the zipfile module does not support streaming
data in (or if it does now we aren't using it).
So numpy writes the file uncompressed to disk and then zips it which is
horrible for performance and disk usage.
As a workaround may also be possible to write the temporary NPY files to
cStringIO instances and then use ``ZipFile.writestr`` with the
``getvalue()`` of the cStringIO object. However that approach may
require some memory. In python 2.7, for each array: one copy inside the
cStringIO instance and then another copy of when calling getvalue on the
cString, I believe.
There is a proof-of-concept implementation here:

https://github.com/esc/numpy/compare/feature;npz_no_temp_file

Here are the timings, again using ``sync()`` from bloscpack (but it's
just a ``os.system('sync')``, in case you want to run your own
benchmarks):

In [1]: import numpy as np

In [2]: import bloscpack.sysutil as bps

In [3]: x = np.linspace(1, 10, 50000000)

In [4]: %timeit np.save("x.npy", x) ; bps.sync()
1 loops, best of 3: 1.93 s per loop

In [5]: %timeit np.savez("x.npz", x) ; bps.sync()
1 loops, best of 3: 7.88 s per loop

In [6]: %timeit np._savez_no_temp("x.npy", [x], {}, False) ; bps.sync()
1 loops, best of 3: 3.22 s per loop

Not too bad, but still slower than plain NPY, memory copies would be my
guess.

V-

PS: Running Python 2.7.6 :: Anaconda 1.9.2 (64-bit) and Numpy master