ANN: python-blosc 1.2.0 released
============================= Announcing python-blosc 1.2.0 ============================= What is new? ============ This release adds support for the multiple compressors added in Blosc 1.3 series. The new compressors are: * lz4 (http://code.google.com/p/lz4/): A very fast compressor/decompressor. Could be thought as a replacement of the original BloscLZ, but it can behave better is some scenarios. * lz4hc (http://code.google.com/p/lz4/): This is a variation of LZ4 that achieves much better compression ratio at the cost of being much slower for compressing. Decompression speed is unaffected (and sometimes better than when using LZ4 itself!), so this is very good for read-only datasets. * snappy (http://code.google.com/p/snappy/): A very fast compressor/decompressor. Could be thought as a replacement of the original BloscLZ, but it can behave better is some scenarios. * zlib (http://www.zlib.net/): This is a classic. It achieves very good compression ratios, at the cost of speed. However, decompression speed is still pretty good, so it is a good candidate for read-only datasets. Selecting the compressor is just a matter of specifying the new `cname` parameter in compression functions. For example:: in = numpy.arange(N, dtype=numpy.int64) out = blosc.pack_array(in, cname="lz4") Just to have an overview of the differences between the different compressors in new Blosc, here it is the output of the included compress_ptr.py benchmark: https://github.com/ContinuumIO/python-blosc/blob/master/bench/compress_ptr.p... that compresses/decompresses NumPy arrays with different data distributions:: Creating different NumPy arrays with 10**7 int64/float64 elements: *** np.copy() **** Time for memcpy(): 0.030 s *** the arange linear distribution *** *** blosclz *** Time for comp/decomp: 0.013/0.022 s. Compr ratio: 136.83 *** lz4 *** Time for comp/decomp: 0.009/0.031 s. Compr ratio: 137.19 *** lz4hc *** Time for comp/decomp: 0.103/0.021 s. Compr ratio: 165.12 *** snappy *** Time for comp/decomp: 0.012/0.045 s. Compr ratio: 20.38 *** zlib *** Time for comp/decomp: 0.243/0.056 s. Compr ratio: 407.60 *** the linspace linear distribution *** *** blosclz *** Time for comp/decomp: 0.031/0.036 s. Compr ratio: 14.27 *** lz4 *** Time for comp/decomp: 0.016/0.033 s. Compr ratio: 19.68 *** lz4hc *** Time for comp/decomp: 0.188/0.020 s. Compr ratio: 78.21 *** snappy *** Time for comp/decomp: 0.020/0.032 s. Compr ratio: 11.72 *** zlib *** Time for comp/decomp: 0.290/0.048 s. Compr ratio: 90.90 *** the random distribution *** *** blosclz *** Time for comp/decomp: 0.083/0.025 s. Compr ratio: 4.35 *** lz4 *** Time for comp/decomp: 0.022/0.034 s. Compr ratio: 4.65 *** lz4hc *** Time for comp/decomp: 1.803/0.039 s. Compr ratio: 5.61 *** snappy *** Time for comp/decomp: 0.028/0.023 s. Compr ratio: 4.48 *** zlib *** Time for comp/decomp: 3.146/0.073 s. Compr ratio: 6.17 That means that Blosc in combination with LZ4 can compress at speeds that can be up to 3x faster than a pure memcpy operation. Decompression is a bit slower (but still in the same order than memcpy()) probably because writing to memory is slower than reading. This was using an Intel Core i5-3380M CPU @ 2.90GHz, runnng Python 3.3 and Linux 3.7.10, but YMMV (and will vary!). For more info, you can have a look at the release notes in: https://github.com/ContinuumIO/python-blosc/wiki/Release-notes More docs and examples are available in the documentation site: http://blosc.pydata.org What is it? =========== python-blosc (http://blosc.pydata.org/) is a Python wrapper for the Blosc compression library. Blosc (http://blosc.org) is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Whether this is achieved or not depends of the data compressibility, the number of cores in the system, and other factors. See a series of benchmarks conducted for many different systems: http://blosc.org/trac/wiki/SyntheticBenchmarks. Blosc works well for compressing numerical arrays that contains data with relatively low entropy, like sparse data, time series, grids with regular-spaced values, etc. There is also a handy command line for Blosc called Bloscpack (https://github.com/esc/bloscpack) that allows you to compress large binary datafiles on-disk. Although the format for Bloscpack has not stabilized yet, it allows you to effectively use Blosc from your favorite shell. Installing ========== python-blosc is in PyPI repository, so installing it is easy: $ pip install -U blosc # yes, you should omit the python- prefix Download sources ================ The sources are managed through github services at: http://github.com/ContinuumIO/python-blosc Documentation ============= There is Sphinx-based documentation site at: http://blosc.pydata.org/ Mailing list ============ There is an official mailing list for Blosc at: blosc@googlegroups.com http://groups.google.es/group/blosc Licenses ======== Both Blosc and its Python wrapper are distributed using the MIT license. See: https://github.com/ContinuumIO/python-blosc/blob/master/LICENSES for more details. -- Francesc Alted Continuum Analytics, Inc. -- Francesc Alted
participants (1)
-
Francesc Alted