[Numpy-discussion] About the npz format
Francesc Alted
faltet at gmail.com
Fri Apr 18 08:03:00 EDT 2014
El 18/04/14 13:01, Valentin Haenel ha escrit:
> Hi again,
>
> * onefire <onefire.myself at gmail.com> [2014-04-18]:
>> I think your workaround might help, but a better solution would be to not
>> use Python's zipfile module at all. This would make it possible to, say,
>> let the user choose the checksum algorithm or to turn that off.
>> Or maybe the compression stuff makes this route too complicated to be worth
>> the trouble? (after all, the zip format is not that hard to understand)
> Just to give you an idea of what my aforementioned Bloscpack library can
> do in the case of linspace:
>
> In [1]: import numpy as np
>
> In [2]: import bloscpack as bp
>
> In [3]: import bloscpack.sysutil as bps
>
> In [4]: x = np.linspace(1, 10, 50000000)
>
> In [5]: %timeit np.save("x.npy", x) ; bps.sync()
> 1 loops, best of 3: 2.12 s per loop
>
> In [6]: %timeit bp.pack_ndarray_file(x, 'x.blp') ; bps.sync()
> 1 loops, best of 3: 627 ms per loop
>
> In [7]: %timeit -n 3 -r 3 np.save("x.npy", x) ; bps.sync()
> 3 loops, best of 3: 1.92 s per loop
>
> In [8]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x.blp') ; bps.sync()
> 3 loops, best of 3: 564 ms per loop
>
> In [9]: ls -lah x.npy x.blp
> -rw-r--r-- 1 root root 49M Apr 18 12:53 x.blp
> -rw-r--r-- 1 root root 382M Apr 18 12:52 x.npy
>
> However, this is a bit of special case, since Blosc does extremely well
> -- both speed and size wise -- on the linspace data, your milage may
> vary.
Exactly, and besides, Blosc can use different codes inside it. Just for
completeness, here it is a small benchmark of what you can expect from
them (my laptop does not have a SSD, so my figures are a bit slow
compared with Valentin's):
In [50]: %timeit -n 3 -r 3 np.save("x.npy", x) ; bps.sync()
3 loops, best of 3: 5.7 s per loop
In [51]: cargs = bp.args.DEFAULT_BLOSC_ARGS
In [52]: cargs['cname'] = 'blosclz'
In [53]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-blosclz.blp',
blosc_args=cargs) ; bps.sync()
3 loops, best of 3: 1.12 s per loop
In [54]: cargs['cname'] = 'lz4'
In [55]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-lz4.blp',
blosc_args=cargs) ; bps.sync()
3 loops, best of 3: 985 ms per loop
In [56]: cargs['cname'] = 'lz4hc'
In [57]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-lz4hc.blp',
blosc_args=cargs) ; bps.sync()
3 loops, best of 3: 1.95 s per loop
In [58]: cargs['cname'] = 'snappy'
In [59]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-snappy.blp',
blosc_args=cargs) ; bps.sync()
3 loops, best of 3: 1.11 s per loop
In [60]: cargs['cname'] = 'zlib'
In [61]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-zlib.blp',
blosc_args=cargs) ; bps.sync()
3 loops, best of 3: 3.12 s per loop
so all the codecs can make the storage go faster than a pure np.save(),
and most specially blosclz, lz4 and snappy. However, lz4hc and zlib
achieve the best compression ratios:
In [62]: ls -lht x*.*
-rw-r--r-- 1 faltet users 7,0M 18 abr 13:49 x-zlib.blp
-rw-r--r-- 1 faltet users 54M 18 abr 13:48 x-snappy.blp
-rw-r--r-- 1 faltet users 7,0M 18 abr 13:48 x-lz4hc.blp
-rw-r--r-- 1 faltet users 48M 18 abr 13:47 x-lz4.blp
-rw-r--r-- 1 faltet users 49M 18 abr 13:47 x-blosclz.blp
-rw-r--r-- 1 faltet users 382M 18 abr 13:42 x.npy
But again, we are talking about a specially nice compression case.
--
Francesc Alted
More information about the NumPy-Discussion
mailing list