[Numpy-discussion] About the npz format

Valentin Haenel valentin at haenel.co
Thu Apr 17 16:01:04 EDT 2014


Hi again,

* David Palao <dpalao.python at gmail.com> [2014-04-17]:
> 2014-04-16 20:26 GMT+02:00 onefire <onefire.myself at gmail.com>:
> > Hi all,
> >
> > I have been playing with the idea of using Numpy's binary format as a
> > lightweight alternative to HDF5 (which I believe is the "right" way to do if
> > one does not have a problem with the dependency).
> >
> > I am pretty happy with the npy format, but the npz format seems to be broken
> > as far as performance is concerned (or I am missing obvious!). The following
> > ipython session illustrates the issue:
> >
> > ln [1]: import numpy as np
> >
> > In [2]: x = np.linspace(1, 10, 50000000)
> >
> > In [3]: %time np.save("x.npy", x)
> > CPU times: user 40 ms, sys: 230 ms, total: 270 ms
> > Wall time: 488 ms
> >
> > In [4]: %time np.savez("x.npz", data = x)
> > CPU times: user 657 ms, sys: 707 ms, total: 1.36 s
> > Wall time: 7.7 s
> >
> 
> Hi,
> In my case (python-2.7.3, numpy-1.6.1):
> 
> In [23]: %time save("xx.npy", x)
> CPU times: user 0.00 s, sys: 0.23 s, total: 0.23 s
> Wall time: 4.07 s
> 
> In [24]: %time savez("xx.npz", data = x)
> CPU times: user 0.42 s, sys: 0.61 s, total: 1.02 s
> Wall time: 4.26 s
> 
> In my case I don't see the "unbelievable amount of overhead" of the npz thing.

When profiling IO operations, there are many factors that can influence
measurements. In my experience on Linux: these may include: the filesystem
cache, the cpu govenor, the system load, power saving features, the type
of hard drive and how it is connected, any powersaving features (e.g.
laptop-mode tools) and any cron-jobs that might be running (e.g.
updating locate DB).

So for example when measuring the time it takes to write something to
disk on Linux, I always at least include a call to ``sync``
which will ensure that all kernel filesystem buffers will be written to
disk. Even then, you may still have a lot of variability.

As part of bloscpack.sysutil I have wrapped this to be available from
Python (needs root though). So, to re-rurn the benchmarks, doing each
one twice:

In [1]: import numpy as np

In [2]: import bloscpack.sysutil as bps

In [3]: x = np.linspace(1, 10, 50000000)

In [4]: %time np.save("x.npy", x)
CPU times: user 12 ms, sys: 356 ms, total: 368 ms
Wall time: 1.41 s

In [5]: %time np.save("x.npy", x)
CPU times: user 0 ns, sys: 368 ms, total: 368 ms
Wall time: 811 ms

In [6]: %time np.savez("x.npz", data = x)
CPU times: user 540 ms, sys: 864 ms, total: 1.4 s
Wall time: 4.74 s

In [7]: %time np.savez("x.npz", data = x)
CPU times: user 580 ms, sys: 808 ms, total: 1.39 s
Wall time: 9.47 s

In [8]: bps.sync()

In [9]: %time np.save("x.npy", x) ; bps.sync()
CPU times: user 0 ns, sys: 368 ms, total: 368 ms
Wall time: 2.2 s

In [10]: %time np.save("x.npy", x) ; bps.sync()
CPU times: user 0 ns, sys: 356 ms, total: 356 ms
Wall time: 2.16 s

In [11]: bps.sync()

In [12]: %time np.savez("x.npz", x) ; bps.sync()
CPU times: user 564 ms, sys: 816 ms, total: 1.38 s
Wall time: 8.21 s

In [13]: %time np.savez("x.npz", x) ; bps.sync()
CPU times: user 588 ms, sys: 772 ms, total: 1.36 s
Wall time: 6.83 s

As you can see, even when using ``sync`` the values might vary, so in
addition it might be worth using %timeit, which will at least run it
three times and select the best one in its default setting:

In [14]: %timeit np.save("x.npy", x)
1 loops, best of 3: 2.4 s per loop

In [15]: %timeit np.savez("x.npz", x)
1 loops, best of 3: 7.1 s per loop

In [16]: %timeit np.save("x.npy", x) ; bps.sync()
1 loops, best of 3: 3.11 s per loop

In [17]: %timeit np.savez("x.npz", x) ; bps.sync()
1 loops, best of 3: 7.36 s per loop

So, anyway, given these readings,  I would tend to support the claim
that there is something slowing down writing when using plain NPZ w/o
compression.

FYI: when reading, the kernel keeps files that were recently read in the
filesystem buffers and so when measuring reads, I tend to drop those
caches using ``drop_caches()`` from bloscpack.sysutil (which wraps using
the linux proc fs).

best,

V-



More information about the NumPy-Discussion mailing list