ANN: PyTables 0.9 released

Fri Nov 5 21:26:09 CET 2004

Announcing PyTables 0.9
-----------------------

I'm proud to announce the availability of the newest and most powerful
incarnation of PyTables ever <wink>. On this release you will find a
series of exciting new features, being the most important the indexing
capabilities, in-kernel selections, support for complex datatypes and
the possibility to modify values in both tables *and* arrays.

What is
-------

PyTables is a hierarchical database package designed to efficiently
manage extremely large amounts of data (supports full 64-bit file
addressing). It features an object-oriented interface that, combined
with C extensions for the peformance-critical parts of the code, makes
it a very easy to use tool for high performance data saving and
retrieving.

It is built on top of the HDF5 library and the numarray package, and
provides containers for both heterogeneous data (Tables) and
homogeneous data (Array, EArray). It also sports a container for
keeping lists of objects of variable length on a very efficient way
(VLArray). A flexible support of filters allows you to compress your
data on-the-flight by using different compressors and compression
enablers.

Moreover, its powerful browsing and searching capabilities allow you
to do data selections over tables exceeding gigabytes of data in just
tenths of second.

Changes more in depth
---------------------

New features:

- Indexing of columns in tables. That allow to make data selections on
  tables up to 500 times faster than standard selections (for
  ex. doing a selection along an indexed column of 100 milion of rows
  takes less than 1 second on a modern CPU). 

  Perhaps the most interesting thing about the indexing algorithm
  implemented by PyTables is that the time taken to index grows
  *lineraly* with the length of the data, so, making the indexation
  process to be *scalable* (quite differently to many relational
  databases). This means that it can index, in a relatively quick way,
  arbitrarily large table columns (for ex. indexing a column of 100
  milion of rows takes just 100 seconds, i.e. at a rate of 1
  Mrow/sec). See more detailed info about that in
  http://pytables.sourceforge.net/doc/SciPy04.pdf.

- In-kernel selections. This feature allow to make data selections on
  tables up to 5 times faster than standard selections (i.e. pre-0.9
  selections), without a need to create an index. As a hint of how
  fast these selections can be, they are up to 10 times faster than a
  traditional relational database. Again, see
  http://pytables.sourceforge.net/doc/SciPy04.pdf for some experiments
  on that matter.

- Support of complex datatypes for all the data objects (i.e. Table,
  Array, EArray and VLArray). With that, the complete set of datatypes
  of Numeric and numarray packages are supported. Thanks to Tom Hedley
  for providing the patches for Array, EArray and VLArray objects, as
  well as updating the User's Manual and adding unit tests for the new
  functionality.

- Modification of values. You can modifiy Table, Array, EArray and
  VLArray values. See Table.modifyRows, Table.modifyColumns() and the
  newly introduced __setitem__() method for Table, Array, EArray and
  VLArray entities in the Library Reference of User's Manual.

- A new sub-package called "nodes" is there. On it, there will be
  included different modules to make more easy working with different
  entities (like images, files, ...). The first module that has been
  added to this sub-package is "FileNode", whose mission is to enable
  the creation of a database of nodes which can be used like regular
  opened files in Python.  In other words, you can store a set of
  files in a PyTables database, and read and write it as you would do
  with any other file in Python. Thanks to Ivan Vilata i Balaguer for
  contributing this.

Improvements:

- New __len__(self) methods added in Arrays, Tables and Columns. This,
  in combination with __getitem__(self,key) allows to better emulate
  sequences.

- Better capabilities to import generic HDF5 files. In particular,
  Table objects (in the HDF5_HL naming schema) with "holes" in their
  compound type definition are supported. That allows to read certain
  files produced by NASA (thanks to Stephen Walton for reporting this).

- Much improved test units. More than 2000 different tests has been
  implemented which accounts for more than 13000 loc (this represents
  twice of the PyTables library code itself (!)).

Backward-incompatible API changes:

- The __call__ special method has been removed from objects File,
  Group, Table, Array, EArray and VLArray. Now, you must use
  walkNodes() in File and Group and iterrows in Table, Array, EArray
  and VLArray so as to achieve the same functionality. This will
  provide better compatibility with IPython as well.

'nctoh5', a new importing utility:

- Jeff Whitaker has contributed a script to easily convert NetCDF
  files into HDF5 files using Scientific Python and PyTables. It has
  been included and documented as a new utility.

Bug fixes:

- A call to File.flush() now invoke a call to H5Fflush() so to
  effectively flushing all the file contents to disk. Thanks to Shack
  Toms for reporting this and providing a patch.

- SF #1054683: Security hole in utils.checkNameValidity(). Reported in
  2004-10-26 by ivilata

- SF #1049297: Suggestion: new method File.delAttrNode(). Reported in 
  2004-10-18 by ivilata

- SF #1049285: Leak in AttributeSet.__delattr__(). Reported in
  2004-10-18 by ivilata

- SF #1014298: Wrong method call in examples/tutorial1-2.py. Reported
  in 2004-08-23 by ivilata

- SF #1013202: Cryptic error appending to EArray on RO file. Reported
  in 2004-08-21 by ivilata

- SF #991715: Table.read(field="var1", flavor="List") fails. Reported
  in 2004-07-15 by falted

- SF #988547: Wrong file type assumption in File.__new__. Reported in
  2004-07-10 by ivilata

Where PyTables can be applied?
------------------------------

PyTables is not designed to work as a relational database competitor,
but rather as a teammate. If you want to work with large datasets of
multidimensional data (for example, for multidimensional analysis), or
just provide a categorized structure for some portions of your cluttered
RDBS, then give PyTables a try. It works well for storing data from data
acquisition systems (DAS), simulation software, network data monitoring
systems (for example, traffic measurements of IP packets on routers),
very large XML files, or for creating a centralized repository for system 
logs, to name only a few possible uses.

What is a table?
----------------

A table is defined as a collection of records whose values are stored in
fixed-length fields. All records have the same structure and all values
in each field have the same data type.  The terms "fixed-length" and
"strict data types" seem to be quite a strange requirement for a
language like Python that supports dynamic data types, but they serve a
useful function if the goal is to save very large quantities of data
(such as is generated by many scientific applications, for example) in
an efficient manner that reduces demand on CPU time and I/O resources.

What is HDF5?
-------------

For those people who know nothing about HDF5, it is a general purpose
library and file format for storing scientific data made at NCSA. HDF5
can store two primary objects: datasets and groups. A dataset is
essentially a multidimensional array of data elements, and a group is a
structure for organizing objects in an HDF5 file. Using these two basic
constructs, one can create and store almost any kind of scientific data
structure, such as images, arrays of vectors, and structured and
unstructured grids. You can also mix and match them in HDF5 files
according to your needs.

Platforms
---------

I'm using Linux (Intel 32-bit) as the main development platform, but
PyTables should be easy to compile/install on many other UNIX
machines. This package has also passed all the tests on a UltraSparc
platform with Solaris 7 and Solaris 8. It also compiles and passes all
the tests on a SGI Origin2000 with MIPS R12000 processors, with the
MIPSPro compiler and running IRIX 6.5. It also runs fine on Linux
64-bit platforms, like AMD Opteron running GNU/Linux 2.4.21 Server,
Intel Itanium (IA64) running GNU/Linux 2.4.21 or PowerPC G5 with Linux
2.6.x in 64bit mode. It has also been tested in MacOSX platforms (10.2
but should also work on newer versions).

Regarding Windows platforms, PyTables has been tested with Windows
2000 and Windows XP (using the Microsoft Visual C compiler), but it
should also work with other flavors as well.

Web site
--------

Go to the PyTables web site for more details:

http://pytables.sourceforge.net/

To know more about the company behind the PyTables development, see:

http://www.carabos.com/

Share your experience
---------------------

Let me know of any bugs, suggestions, gripes, kudos, etc. you may
have.

Bon profit!

-- Francesc Altet