[Numpy-discussion] About the npz format

Fri Apr 18 08:03:00 EDT 2014

El 18/04/14 13:01, Valentin Haenel ha escrit:
> Hi again,
>
> * onefire <onefire.myself at gmail.com> [2014-04-18]:
>> I think your workaround might help, but a better solution would be to not
>> use Python's zipfile module at all. This would make it possible to, say,
>> let the user choose the checksum algorithm or to turn that off.
>> Or maybe the compression stuff makes this route too complicated to be worth
>> the trouble? (after all, the zip format is not that hard to understand)
> Just to give you an idea of what my aforementioned Bloscpack library can
> do in the case of linspace:
>
> In [1]: import numpy as np
>
> In [2]: import bloscpack as bp
>
> In [3]: import bloscpack.sysutil as bps
>
> In [4]: x = np.linspace(1, 10, 50000000)
>
> In [5]: %timeit np.save("x.npy", x) ; bps.sync()
> 1 loops, best of 3: 2.12 s per loop
>
> In [6]: %timeit bp.pack_ndarray_file(x, 'x.blp') ; bps.sync()
> 1 loops, best of 3: 627 ms per loop
>
> In [7]: %timeit -n 3 -r 3 np.save("x.npy", x) ; bps.sync()
> 3 loops, best of 3: 1.92 s per loop
>
> In [8]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x.blp') ; bps.sync()
> 3 loops, best of 3: 564 ms per loop
>
> In [9]: ls -lah x.npy x.blp
> -rw-r--r-- 1 root root  49M Apr 18 12:53 x.blp
> -rw-r--r-- 1 root root 382M Apr 18 12:52 x.npy
>
> However, this is a bit of special case, since Blosc does extremely well
> -- both speed and size wise -- on the linspace data, your milage may
> vary.

Exactly, and besides, Blosc can use different codes inside it.  Just for 
completeness, here it is a small benchmark of what you can expect from 
them (my laptop does not have a SSD, so my figures are a bit slow 
compared with Valentin's):

In [50]: %timeit -n 3 -r 3 np.save("x.npy", x) ; bps.sync()
3 loops, best of 3: 5.7 s per loop

In [51]: cargs = bp.args.DEFAULT_BLOSC_ARGS

In [52]: cargs['cname'] = 'blosclz'

In [53]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-blosclz.blp', 
blosc_args=cargs) ; bps.sync()
3 loops, best of 3: 1.12 s per loop

In [54]: cargs['cname'] = 'lz4'

In [55]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-lz4.blp', 
blosc_args=cargs) ; bps.sync()
3 loops, best of 3: 985 ms per loop

In [56]: cargs['cname'] = 'lz4hc'

In [57]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-lz4hc.blp', 
blosc_args=cargs) ; bps.sync()
3 loops, best of 3: 1.95 s per loop

In [58]: cargs['cname'] = 'snappy'

In [59]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-snappy.blp', 
blosc_args=cargs) ; bps.sync()
3 loops, best of 3: 1.11 s per loop

In [60]: cargs['cname'] = 'zlib'

In [61]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-zlib.blp', 
blosc_args=cargs) ; bps.sync()
3 loops, best of 3: 3.12 s per loop

so all the codecs can make the storage go faster than a pure np.save(), 
and most specially blosclz, lz4 and snappy.  However, lz4hc and zlib 
achieve the best compression ratios:

In [62]: ls -lht x*.*
-rw-r--r-- 1 faltet users 7,0M 18 abr 13:49 x-zlib.blp
-rw-r--r-- 1 faltet users  54M 18 abr 13:48 x-snappy.blp
-rw-r--r-- 1 faltet users 7,0M 18 abr 13:48 x-lz4hc.blp
-rw-r--r-- 1 faltet users  48M 18 abr 13:47 x-lz4.blp
-rw-r--r-- 1 faltet users  49M 18 abr 13:47 x-blosclz.blp
-rw-r--r-- 1 faltet users 382M 18 abr 13:42 x.npy

But again, we are talking about a specially nice compression case.

-- 
Francesc Alted