Re: [Numpy-discussion] About the npz format

April 18, 2014

      El 18/04/14 13:01, Valentin Haenel ha escrit:
...
Hi again,
* onefire <onefire.myself@gmail.com> [2014-04-18]:
...
I think your workaround might help, but a better solution would be to not
use Python's zipfile module at all. This would make it possible to, say,
let the user choose the checksum algorithm or to turn that off.
Or maybe the compression stuff makes this route too complicated to be worth
the trouble? (after all, the zip format is not that hard to understand)
Just to give you an idea of what my aforementioned Bloscpack library can
do in the case of linspace:
In [1]: import numpy as np
In [2]: import bloscpack as bp
In [3]: import bloscpack.sysutil as bps
In [4]: x = np.linspace(1, 10, 50000000)
In [5]: %timeit np.save("x.npy", x) ; bps.sync()
1 loops, best of 3: 2.12 s per loop
In [6]: %timeit bp.pack_ndarray_file(x, 'x.blp') ; bps.sync()
1 loops, best of 3: 627 ms per loop
In [7]: %timeit -n 3 -r 3 np.save("x.npy", x) ; bps.sync()
3 loops, best of 3: 1.92 s per loop
In [8]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x.blp') ; bps.sync()
3 loops, best of 3: 564 ms per loop
In [9]: ls -lah x.npy x.blp
-rw-r--r-- 1 root root  49M Apr 18 12:53 x.blp
-rw-r--r-- 1 root root 382M Apr 18 12:52 x.npy
However, this is a bit of special case, since Blosc does extremely well
-- both speed and size wise -- on the linspace data, your milage may
vary.
Exactly, and besides, Blosc can use different codes inside it.  Just for 
completeness, here it is a small benchmark of what you can expect from 
them (my laptop does not have a SSD, so my figures are a bit slow 
compared with Valentin's):

In [50]: %timeit -n 3 -r 3 np.save("x.npy", x) ; bps.sync()
3 loops, best of 3: 5.7 s per loop

In [51]: cargs = bp.args.DEFAULT_BLOSC_ARGS

In [52]: cargs['cname'] = 'blosclz'

In [53]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-blosclz.blp', 
blosc_args=cargs) ; bps.sync()
3 loops, best of 3: 1.12 s per loop

In [54]: cargs['cname'] = 'lz4'

In [55]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-lz4.blp', 
blosc_args=cargs) ; bps.sync()
3 loops, best of 3: 985 ms per loop

In [56]: cargs['cname'] = 'lz4hc'

In [57]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-lz4hc.blp', 
blosc_args=cargs) ; bps.sync()
3 loops, best of 3: 1.95 s per loop

In [58]: cargs['cname'] = 'snappy'

In [59]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-snappy.blp', 
blosc_args=cargs) ; bps.sync()
3 loops, best of 3: 1.11 s per loop

In [60]: cargs['cname'] = 'zlib'

In [61]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-zlib.blp', 
blosc_args=cargs) ; bps.sync()
3 loops, best of 3: 3.12 s per loop

so all the codecs can make the storage go faster than a pure np.save(), 
and most specially blosclz, lz4 and snappy.  However, lz4hc and zlib 
achieve the best compression ratios:

In [62]: ls -lht x*.*
-rw-r--r-- 1 faltet users 7,0M 18 abr 13:49 x-zlib.blp
-rw-r--r-- 1 faltet users  54M 18 abr 13:48 x-snappy.blp
-rw-r--r-- 1 faltet users 7,0M 18 abr 13:48 x-lz4hc.blp
-rw-r--r-- 1 faltet users  48M 18 abr 13:47 x-lz4.blp
-rw-r--r-- 1 faltet users  49M 18 abr 13:47 x-blosclz.blp
-rw-r--r-- 1 faltet users 382M 18 abr 13:42 x.npy

But again, we are talking about a specially nice compression case.

-- 
Francesc Alted