[SciPy-user] PyTables 1.0 released

Francesc Altet falted at pytables.org
Tue May 10 07:27:46 EDT 2005


=========================
 Announcing PyTables 1.0
=========================

The Carabos crew is very proud to announce the immediate availability
of **PyTables release 1.0**.  On this release you will find a series
of exciting new features, being the most important the Undo/Redo
capabilities, support for objects (and indexes!) with more than 2**31
rows, better I/O performance for Numeric objects, new time datatypes
(useful for time-stamping fields), support for Octave HDF5 files and
improved support for HDF5 native files.


What it is
==========

**PyTables** is a package for managing hierarchical datasets and
designed to efficiently cope with extremely large amounts of data
(with support for full 64-bit file addressing).  It features an
object-oriented interface that, combined with C extensions for the
performance-critical parts of the code, makes it a very easy-to-use
tool for high performance data storage and retrieval.

It is built on top of the HDF5 library and the numarray package, and
provides containers for both heterogeneous data (``Table``) and
homogeneous data (``Array``, ``EArray``) as well as containers for
keeping lists of objects of variable length (like Unicode strings or
general Python objects) in a very efficient way (``VLArray``).  It
also sports a series of filters allowing you to compress your data
on-the-fly by using different compressors and compression enablers.

But perhaps the more interesting features are its powerful browsing
and searching capabilities that allow you to perform data selections
over heterogeneous datasets exceeding gigabytes of data in just tenths
of second.  Besides, the PyTables I/O is buffered, implemented in C
and carefully tuned so that you can reach much better performance with
PyTables than with your own home-grown wrappings to the HDF5 library.


Changes more in depth
=====================

Improvements:

- New Undo/Redo feature (i.e. integrated support for undoing and/or
  redoing actions).  This functionality lets you to put marks in
  specific places of your data operations, so that you can make your
  HDF5 file pop back (undo) to a specific mark (for example for
  inspecting how your data looked at that point).  You can also go
  forward to a more recent marker (redo).  You can even do jumps to
  the marker you want using just one instruction.

- Reading Numeric objects from ``*Array`` and ``Table`` (Numeric
  columns) objects have a 50-100x speedup.  With that, Louis Wicker
  reported that a speed of 350 MB/s can be achieved with Numeric
  objects (on a SGI Altix with a Raid 5 disk array) while with
  numarrays, this speed approaches 900 MB/s.  This improvement has
  been possible mainly due to a nice recipe from Todd Miller.  Thanks
  Todd!

- With PyTables 1.0 you can finally create Tables, EArrays and
  VLArrays with more than 2**31 (~ 2 thousand millions) rows, as well
  as retrieve them. Before PyTables 1.0, retrieving data on these
  beasts was not well supported, in part due to limitations in some
  slicing functions in Python (that rely on 32-bit adressing). So, we
  have made the necessary modifications in these functions to support
  64-bit indexes and integrated them into PyTables.  As a result, our
  tests shows that this feature works just fine.

- As a consequence of the above, you can now index columns of tables
  with more than 2**31 rows.  For instance, indexes have been created
  for integer columns with 10**10 (yeah, 10 thousand million) rows in
  less than 1 hour using an Opteron @ 1.6 GHz system (~ 1 hour and
  half with a Xeon Intel32 @ 2.5 GHz platform).  Enjoy!

- Now PyTables supports the native HDF5 time types, both 32-bit signed
  integer and 64-bit fixed point timestamps.  They are mapped to
  ``Int32`` and ``Float64`` values for easy manipulation.  See the
  documentation for the ``Time32Col`` and ``Time64Col`` classes.

- Massive internal reorganization of the methods that deal with the
  hierarchy. Hopefully, that will enable a better understanding of the
  code for anybody wanting to add/modify features.

- The opening and copying of files with large number of objects has
  been made faster by correcting a typo in ``Table._open()``.  Thanks
  to Ashley Walsh for sending a patch for this.

- Now, one can modify rank-0 (scalar) ``EArray`` datasets.  Thanks to
  Norbert Nemec for providing a patch for this.

- You are allowed from this version on to add non-valid natural naming
  names as node or attribute names.  A warning is issued to warn (but
  not forbid) you in such a case.  Of course, you have to use the
  ``getattr()`` function so as to access such invalid natural names.

- The indexes of ``Table`` and ``*Array`` datasets can be of long type
  besides of integer type.  However, indexes in slices are still
  restricted to regular integer type.

- The concept of ``READ_ONLY`` system attributes has disappeared.  You
  can change them now at your own risk!  However, you still cannot
  remove or rename system attributes.

- Now, one can do reads in-between write loops using ``table.row``
  instances.  This is thanks to a decoupling in I/O buffering: now
  there is a buffer for reading and other for writing, so that no
  collisions take place anymore.  Fixes #1186892.

- Support for Octave HDF5 output format.  Even complex arrays are
  supported.  Thanks to Edward C. Jones for reporting this format.

Backward-incompatible changes:

- The format of indexes has been changed and indexes in files created
  with PyTables pre-1.0 versions are ignored now.  However,
  ``ptrepack`` can still save your life because it is able to convert
  your old files into the new indexing format.  Also, if you copy the
  affected tables to other locations (by using ``Leaf.copy()``), it
  will regenerate your indexes with the new format for you.

- The API has changed a little bit (nothing serious) for some methods.
  See ``RELEASE-NOTES.txt`` for more details.

Bug fixes:

- Added partial support for native HDF5 chunked datasets.  They can be
  read now, and even extended, but only along the first extensible
  dimension.  This limitation may be removed when multiple extensible
  dimensions are supported in PyTables.

- Formerly, when the name of a column in a table was subsumed in
  another column name, PyTables crashed while retrieving information
  of the former column.  That has been fixed.

- A bug prevented the use of indexed columns of tables that were in
  other hierarchical level than root. This is solved now.

- When a ``Group`` was renamed you were not able to modify its
  attributes.  This has been fixed.

- When whether ``Table.modifyColumns()`` or ``Table.modifyRows()``
  were called, a subsequent call to ``Table.flush()`` didn't really
  flush the modified data to disk.  This works as intended now.

- Fixed some issues when iterating over ``*Array`` objects using the
  ``List`` or ``Tuple`` flavor.


Important note for Python 2.4 and Windows users
===============================================

If you are willing to use PyTables with Python 2.4 in Windows
platforms, you will need to get the HDF5 library compiled for MSVC
7.1, aka .NET 2003.  It can be found at:
ftp://ftp.ncsa.uiuc.edu/HDF/HDF5/current/bin/windows/5-164-win-net.ZIP

Users of Python 2.3 on Windows will have to download the version of
HDF5 compiled with MSVC 6.0 available in:
ftp://ftp.ncsa.uiuc.edu/HDF/HDF5/current/bin/windows/5-164-win.ZIP


Where can PyTables be applied?
==============================

PyTables is not designed to work as a relational database competitor,
but rather as a teammate.  If you want to work with large datasets of
multidimensional data (for example, for multidimensional analysis), or
just provide a categorized structure for some portions of your
cluttered RDBS, then give PyTables a try.  It works well for storing
data from data acquisition systems (DAS), simulation software, network
data monitoring systems (for example, traffic measurements of IP
packets on routers), very large XML files, or for creating a
centralized repository for system logs, to name only a few possible
uses.


What is a table?
================

A table is defined as a collection of records whose values are stored
in fixed-length fields.  All records have the same structure and all
values in each field have the same data type.  The terms
"fixed-length" and "strict data types" seem to be quite a strange
requirement for a language like Python that supports dynamic data
types, but they serve a useful function if the goal is to save very
large quantities of data (such as is generated by many scientific
applications, for example) in an efficient manner that reduces demand
on CPU time and I/O resources.


What is HDF5?
=============

For those people who know nothing about HDF5, it is a general purpose
library and file format for storing scientific data made at NCSA.
HDF5 can store two primary objects: datasets and groups.  A dataset is
essentially a multidimensional array of data elements, and a group is
a structure for organizing objects in an HDF5 file.  Using these two
basic constructs, one can create and store almost any kind of
scientific data structure, such as images, arrays of vectors, and
structured and unstructured grids.  You can also mix and match them in
HDF5 files according to your needs.


Platforms
=========

We are using Linux on top of Intel32 as the main development platform,
but PyTables should be easy to compile/install on other UNIX machines.
This package has also been successfully compiled and tested on a
FreeBSD 5.4 with Opteron64 processors, a UltraSparc platform with
Solaris 7 and Solaris 8, a SGI Origin3000 with Itanium processors
running IRIX 6.5 (using the gcc compiler), Microsoft Windows and
MacOSX (10.2 although 10.3 should work fine as well). In particular,
it has been thoroughly tested on 64-bit platforms, like Linux-64 on
top of an Intel Itanium, AMD Opteron (in 64-bit mode) or PowerPC G5
(in 64-bit mode) where all the tests pass successfully.

Regarding Windows platforms, PyTables has been tested with Windows
2000 and Windows XP (using the Microsoft Visual C compiler), but it
should also work with other flavors as well.


Web site
========

Go to the PyTables web site for more details:

http://pytables.sourceforge.net/

To know more about the company behind the PyTables development, see:

http://www.carabos.com/


Share your experience
=====================

Let us know of any bugs, suggestions, gripes, kudos, etc. you may
have.


----

  **Enjoy data!**

  -- The PyTables Team





More information about the SciPy-User mailing list