[Numpy-discussion] ANN: python-blosc 1.2.0 released

Francesc Alted francesc at continuum.io
Sun Jan 26 02:44:25 EST 2014


=============================
Announcing python-blosc 1.2.0
=============================

What is new?
============

This release adds support for the multiple compressors added in Blosc
1.3 series.  The new compressors are:

* lz4 (http://code.google.com/p/lz4/): A very fast
   compressor/decompressor.  Could be thought as a replacement of the
   original BloscLZ, but it can behave better is some scenarios.

* lz4hc (http://code.google.com/p/lz4/): This is a variation of LZ4
   that achieves much better compression ratio at the cost of being
   much slower for compressing.  Decompression speed is unaffected (and
   sometimes better than when using LZ4 itself!), so this is very good
   for read-only datasets.

* snappy (http://code.google.com/p/snappy/): A very fast
   compressor/decompressor.  Could be thought as a replacement of the
   original BloscLZ, but it can behave better is some scenarios.

* zlib (http://www.zlib.net/): This is a classic.  It achieves very
   good compression ratios, at the cost of speed.  However,
   decompression speed is still pretty good, so it is a good candidate
   for read-only datasets.

Selecting the compressor is just a matter of specifying the new `cname`
parameter in compression functions.  For example::

   in = numpy.arange(N, dtype=numpy.int64)
   out = blosc.pack_array(in, cname="lz4")

Just to have an overview of the differences between the different
compressors in new Blosc, here it is the output of the included
compress_ptr.py benchmark:

https://github.com/ContinuumIO/python-blosc/blob/master/bench/compress_ptr.py

that compresses/decompresses NumPy arrays with different data
distributions::

   Creating different NumPy arrays with 10**7 int64/float64 elements:
     *** np.copy() **** Time for memcpy():     0.030 s

   *** the arange linear distribution ***
     *** blosclz  *** Time for comp/decomp: 0.013/0.022 s. Compr ratio: 
136.83
     *** lz4      *** Time for comp/decomp: 0.009/0.031 s. Compr ratio: 
137.19
     *** lz4hc    *** Time for comp/decomp: 0.103/0.021 s. Compr ratio: 
165.12
     *** snappy   *** Time for comp/decomp: 0.012/0.045 s. Compr ratio:  
20.38
     *** zlib     *** Time for comp/decomp: 0.243/0.056 s. Compr ratio: 
407.60

   *** the linspace linear distribution ***
     *** blosclz  *** Time for comp/decomp: 0.031/0.036 s. Compr ratio:  
14.27
     *** lz4      *** Time for comp/decomp: 0.016/0.033 s. Compr ratio:  
19.68
     *** lz4hc    *** Time for comp/decomp: 0.188/0.020 s. Compr ratio:  
78.21
     *** snappy   *** Time for comp/decomp: 0.020/0.032 s. Compr ratio:  
11.72
     *** zlib     *** Time for comp/decomp: 0.290/0.048 s. Compr ratio:  
90.90

   *** the random distribution ***
     *** blosclz  *** Time for comp/decomp: 0.083/0.025 s. Compr 
ratio:   4.35
     *** lz4      *** Time for comp/decomp: 0.022/0.034 s. Compr 
ratio:   4.65
     *** lz4hc    *** Time for comp/decomp: 1.803/0.039 s. Compr 
ratio:   5.61
     *** snappy   *** Time for comp/decomp: 0.028/0.023 s. Compr 
ratio:   4.48
     *** zlib     *** Time for comp/decomp: 3.146/0.073 s. Compr 
ratio:   6.17

That means that Blosc in combination with LZ4 can compress at speeds
that can be up to 3x faster than a pure memcpy operation.
Decompression is a bit slower (but still in the same order than
memcpy()) probably because writing to memory is slower than reading.
This was using an Intel Core i5-3380M CPU @ 2.90GHz, runnng Python 3.3
and Linux 3.7.10, but YMMV (and will vary!).

For more info, you can have a look at the release notes in:

https://github.com/ContinuumIO/python-blosc/wiki/Release-notes

More docs and examples are available in the documentation site:

http://blosc.pydata.org


What is it?
===========

python-blosc (http://blosc.pydata.org/) is a Python wrapper for the
Blosc compression library.

Blosc (http://blosc.org) is a high performance compressor optimized for
binary data.  It has been designed to transmit data to the processor
cache faster than the traditional, non-compressed, direct memory fetch
approach via a memcpy() OS call.  Whether this is achieved or not
depends of the data compressibility, the number of cores in the system,
and other factors.  See a series of benchmarks conducted for many
different systems: http://blosc.org/trac/wiki/SyntheticBenchmarks.

Blosc works well for compressing numerical arrays that contains data
with relatively low entropy, like sparse data, time series, grids with
regular-spaced values, etc.

There is also a handy command line for Blosc called Bloscpack
(https://github.com/esc/bloscpack) that allows you to compress large
binary datafiles on-disk.  Although the format for Bloscpack has not
stabilized yet, it allows you to effectively use Blosc from your
favorite shell.


Installing
==========

python-blosc is in PyPI repository, so installing it is easy:

$ pip install -U blosc  # yes, you should omit the python- prefix


Download sources
================

The sources are managed through github services at:

http://github.com/ContinuumIO/python-blosc


Documentation
=============

There is Sphinx-based documentation site at:

http://blosc.pydata.org/


Mailing list
============

There is an official mailing list for Blosc at:

blosc at googlegroups.com
http://groups.google.es/group/blosc


Licenses
========

Both Blosc and its Python wrapper are distributed using the MIT license.
See:

https://github.com/ContinuumIO/python-blosc/blob/master/LICENSES

for more details.

--
Francesc Alted
Continuum Analytics, Inc.


-- 
Francesc Alted




More information about the NumPy-Discussion mailing list