[Numpy-discussion] [ANN] bcolz v0.9.0

Valentin Haenel valentin at haenel.co
Sun May 17 12:15:09 EDT 2015

Announcing bcolz 0.9.0

What's new

This is mostly a smallish feature and bugfix release. One large topic
was implementing 'addcol' and 'delcol' to properly handle on-disk
tables. 'addcol' now has a new keyword argument 'move' that allows you
to specify if you want to move or copy the data. 'delcol' has a new
keyword argument 'keep' which allows you preserve the data on disk when
removing a column.  Additionally, ctable now supports an 'auto_flush'
keyword that makes it flush to disk automatically after any methods that
may write data.

Another important aspect is handling the GIL. In this release, we do
keep the GIL while calling Blosc compress and decompress in order to
support lock-free operation of newer Blosc versions (1.5.x and beyond)
that no longer have a global state.

Furthermore we now distribute the 'carray_ext.pxd' as part of  the
package via PyPi to ease building applications on bcolz, for example

Finally, the Sphinx based API documentation is now autogenerated from
the docstrings in the Python sources.

For the full list, please check the release notes at:


What it is

*bcolz* provides columnar and compressed data containers that can live
either on-disk or in-memory.  Column storage allows for efficiently
querying tables with a large number of columns.  It also allows for
cheap addition and removal of column.  In addition, bcolz objects are
compressed by default for reducing memory/disk I/O needs. The
compression process is carried out internally by Blosc, an
extremely fast meta-compressor that is optimized for binary data. Lastly,
high-performance iterators (like ``iter()``, ``where()``) for querying
the objects are provided.

bcolz can use numexpr internally so as to accelerate many vector and
query operations (although it can use pure NumPy for doing so too).
numexpr optimizes the memory usage and use several cores for doing the
computations, so it is blazing fast.  Moreover, since the carray/ctable
containers can be disk-based, and it is possible to use them for
seamlessly performing out-of-memory computations.

bcolz has minimal dependencies (NumPy), comes with an exhaustive test
suite and fully supports both 32-bit and 64-bit platforms.  Also, it is
typically tested on both UNIX and Windows operating systems.

Together, bcolz and the Blosc compressor, are finally fulfilling the
promise of accelerating memory I/O, at least for some real scenarios:


Other users of bcolz are Visualfabriq (http://www.visualfabriq.com/) the
Blaze project (http://blaze.pydata.org/), Quantopian
(https://www.quantopian.com/) and Scikit-Allel
(https://github.com/cggh/scikit-allel) which you can read more about by
pointing your browser at the links below.

* Visualfabriq:

  * *bquery*, A query and aggregation framework for Bcolz:
  * https://github.com/visualfabriq/bquery

* Blaze:

  * Notebooks showing Blaze + Pandas + BColz interaction: 
  * http://nbviewer.ipython.org/url/blaze.pydata.org/notebooks/timings-csv.ipynb
  * http://nbviewer.ipython.org/url/blaze.pydata.org/notebooks/timings-bcolz.ipynb

* Quantopian:

  * Using compressed data containers for faster backtesting at scale:
  * https://quantopian.github.io/talks/NeedForSpeed/slides.html

* Scikit-Allel

  * Provides an alternative backend to work with compressed arrays
  * https://scikit-allel.readthedocs.org/en/latest/bcolz.html


bcolz is in the PyPI repository, so installing it is easy::

    $ pip install -U bcolz


Visit the main bcolz site repository at:


Home of Blosc compressor:

User's mail list:
bcolz at googlegroups.com

License is the new BSD:

Release notes can be found in the Git repository:


  **Enjoy data!**

More information about the NumPy-Discussion mailing list