El 18/04/14 13:01, Valentin Haenel ha escrit:
Hi again,
* onefire <onefire.myself@gmail.com> [2014-04-18]:
I think your workaround might help, but a better solution would be to not use Python's zipfile module at all. This would make it possible to, say, let the user choose the checksum algorithm or to turn that off. Or maybe the compression stuff makes this route too complicated to be worth the trouble? (after all, the zip format is not that hard to understand) Just to give you an idea of what my aforementioned Bloscpack library can do in the case of linspace:
In [1]: import numpy as np
In [2]: import bloscpack as bp
In [3]: import bloscpack.sysutil as bps
In [4]: x = np.linspace(1, 10, 50000000)
In [5]: %timeit np.save("x.npy", x) ; bps.sync() 1 loops, best of 3: 2.12 s per loop
In [6]: %timeit bp.pack_ndarray_file(x, 'x.blp') ; bps.sync() 1 loops, best of 3: 627 ms per loop
In [7]: %timeit -n 3 -r 3 np.save("x.npy", x) ; bps.sync() 3 loops, best of 3: 1.92 s per loop
In [8]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x.blp') ; bps.sync() 3 loops, best of 3: 564 ms per loop
In [9]: ls -lah x.npy x.blp -rw-r--r-- 1 root root 49M Apr 18 12:53 x.blp -rw-r--r-- 1 root root 382M Apr 18 12:52 x.npy
However, this is a bit of special case, since Blosc does extremely well -- both speed and size wise -- on the linspace data, your milage may vary.
Exactly, and besides, Blosc can use different codes inside it. Just for completeness, here it is a small benchmark of what you can expect from them (my laptop does not have a SSD, so my figures are a bit slow compared with Valentin's): In [50]: %timeit -n 3 -r 3 np.save("x.npy", x) ; bps.sync() 3 loops, best of 3: 5.7 s per loop In [51]: cargs = bp.args.DEFAULT_BLOSC_ARGS In [52]: cargs['cname'] = 'blosclz' In [53]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-blosclz.blp', blosc_args=cargs) ; bps.sync() 3 loops, best of 3: 1.12 s per loop In [54]: cargs['cname'] = 'lz4' In [55]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-lz4.blp', blosc_args=cargs) ; bps.sync() 3 loops, best of 3: 985 ms per loop In [56]: cargs['cname'] = 'lz4hc' In [57]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-lz4hc.blp', blosc_args=cargs) ; bps.sync() 3 loops, best of 3: 1.95 s per loop In [58]: cargs['cname'] = 'snappy' In [59]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-snappy.blp', blosc_args=cargs) ; bps.sync() 3 loops, best of 3: 1.11 s per loop In [60]: cargs['cname'] = 'zlib' In [61]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-zlib.blp', blosc_args=cargs) ; bps.sync() 3 loops, best of 3: 3.12 s per loop so all the codecs can make the storage go faster than a pure np.save(), and most specially blosclz, lz4 and snappy. However, lz4hc and zlib achieve the best compression ratios: In [62]: ls -lht x*.* -rw-r--r-- 1 faltet users 7,0M 18 abr 13:49 x-zlib.blp -rw-r--r-- 1 faltet users 54M 18 abr 13:48 x-snappy.blp -rw-r--r-- 1 faltet users 7,0M 18 abr 13:48 x-lz4hc.blp -rw-r--r-- 1 faltet users 48M 18 abr 13:47 x-lz4.blp -rw-r--r-- 1 faltet users 49M 18 abr 13:47 x-blosclz.blp -rw-r--r-- 1 faltet users 382M 18 abr 13:42 x.npy But again, we are talking about a specially nice compression case. -- Francesc Alted