[ANN] PyTables release 0.2

Francesc Alted falted@openlc.org
Tue, 19 Nov 2002 20:34:07 +0100


Announcing PyTables 0.2
-----------------------

What's new
-----------

- Numerical Python arrays supported!
- Much improved documentation
- Programming API almost stable
- Improved navegability across the object tree
- Added more unit tests (there are almost 50)
- Dropped HDF5_HL dependency (a tailored version is included in sources now)
- License changed from LGPL to BSD

What is
-------

The goal of PyTables is to enable the end user to manipulate easily
scientific data tables and Numerical Python objects (new in 0.2!)  in
a persistent hierarchical structure. The foundation of the underlying
hierachical data organization is the excellent HDF5 library
(http://hdf.ncsa.uiuc.edu/HDF5). Right now, PyTables provides limited
support of all the HDF5 functions, but I hope to add the more
interesting ones (for PyTables needs) in the near future.
Nonetheless, this package is not intended to serve as a complete
wrapper for the entire HDF5 API.

A table is defined as a collection of records whose values are stored
in fixed-length fields. All records have the same structure and all
values in each field have the same data type.  The terms
"fixed-length" and strict "data types" seems to be quite a strange
requirement for an interpreted language like Python, but they serve a
useful function if the goal is to save very large quantities of data
(such as is generated by many scientifc applications, for example) in
an efficient manner that reduces demand on CPU time and I/O.

In order to emulate records (C structs in HDF5) in Python, PyTables
implements a special metaclass that detects errors in field
assignments as well as range overflows. PyTables also provides a
powerful interface to process table data.

Quite a bit effort has been invested to make browsing the hierarchical
data structure a pleasant experience. PyTables implements just three
(orthogonal) easy-to-use methods for browsing.

What is HDF5?
-------------

For those people who know nothing about HDF5, it is is a general
purpose library and file format for storing scientific data made at
NCSA. HDF5 can store two primary objects: datasets and groups. A
dataset is essentially a multidimensional array of data elements, and
a group is a structure for organizing objects in an HDF5 file. Using
these two basic constructs, one can create and store almost any kind of
scientific data structure, such as images, arrays of vectors, and
structured and unstructured grids. You can also mix and match them in
HDF5 files according to your needs.

How fast is it?
---------------

Despite to be an alpha version and that there is lot of room for
improvements (it's still CPU bounded!), PyTables can read and write
tables quite fast. But, if you want some (very preliminary) figures
(just to know orders of magnitude), in a AMD Athlon@900 it can
currently read from 40000 up to 60000 records/s and write from 5000 up
to 13000 records/s. Raw data speed in read mode ranges from 1 MB/s up
to 2 MB/s, and it drops to the 200 KB/s - 600 KB/s range for writes.

Go to http://pytables.sf.net/bench.html for a somewhat more detailed
description of this small (and synthetic) benchmark.

Anyway, this is only the beginning (premature optimization is the root
of all evils, you know ;-).

Platforms
---------

I'm using Linux as the main development platform, but PyTables should
be easy to compile/install on other UNIX machines. Thanks to Scott
Prater, this package has passed all the tests on a UltraSparc platform
with Solaris 7. It also compiles and passes all the tests on a SGI
Origin2000 with MIPS R12000 processors and running IRIX 6.5.

If you are using Windows and you get the library to work, please let
me know.

An example?
-----------

At the bottom of this message there is some code (less that 100 lines
and only less than half being real code) that shows basic capabilities
of PyTables.

Web site
--------

Go to the PyTables web site for more details:

http://pytables.sf.net/

Final note
----------

This is second alpha release, and probably last alpha, so it is
still time if you want to suggest some API addition/change or
addition/change of any useful missing capability. Let me know of any
bugs, suggestions, gripes, kudos, etc. you may have.

-- Francesc Alted
falted@openlc.org


*-*-*-**-*-*-**-*-*-**-*-*- Small code example  *-*-*-**-*-*-**-*-*-**-*-*-*

"""Small but almost complete example showing the PyTables mode of use.

As a result of execution, a 'tutorial1.h5' file is created. You can
look at it with whatever HDF5 generic utility, like h5ls, h5dump or
h5view.

"""


import sys
from Numeric import *
from tables import *


	#'-**-**-**-**-**-**- user record definition  -**-**-**-**-**-**-**-'

# Define a user record to characterize some kind of particles
class Particle(IsRecord):
    name        = '16s'  # 16-character String
    idnumber    = 'Q'    # unsigned long long (i.e. 64-bit integer)
    TDCcount    = 'B'    # unsigned byte
    ADCcount    = 'H'    # unsigned short integer
    grid_i      = 'i'    # integer
    grid_j      = 'i'    # integer
    pressure    = 'f'    # float  (single-precision)
    energy      = 'd'    # double (double-precision)

print
print	'-**-**-**-**-**-**- file creation  -**-**-**-**-**-**-**-'

# The name of our HDF5 filename
filename = "tutorial1.h5"
    
print "Creating file:", filename

# Open a file in "w"rite mode
h5file = openFile(filename, mode = "w", title = "Test file")

print
print	'-**-**-**-**-**-**- group an table creation  -**-**-**-**-**-**-**-'

# Create a new group under "/" (root)
group = h5file.createGroup("/", 'detector', 'Detector information')
print "Group '/detector' created"

# Create one table on it
table = h5file.createTable(group, 'readout', Particle(), "Readout example")
print "Table '/detector/readout' created"

# Get a shortcut to the record object in table
particle = table.record

# Fill the table with 10 particles
for i in xrange(10):
    # First, assign the values to the Particle record
    particle.name  = 'Particle: %6d' % (i)
    particle.TDCcount = i % 256    
    particle.ADCcount = (i * 256) % (1 << 16)
    particle.grid_i = i 
    particle.grid_j = 10 - i
    particle.pressure = float(i*i)
    particle.energy = float(particle.pressure ** 4)
    particle.idnumber = i * (2 ** 34)  # This exceeds long integer range
    # Insert a new particle record
    table.appendAsRecord(particle)      

# Flush the buffers for table
table.flush()

print
print	'-**-**-**-**-**-**- table data reading & selection  -**-**-**-**-**-'

# Read actual data from table. We are interested in collecting pressure values
# on entries where TDCcount field is greater than 3 and pressure less than 50
pressure = [ x.pressure for x in table.readAsRecords()
	         if x.TDCcount > 3 and x.pressure < 50 ]
print "Last record read:"
print x
print "Field pressure elements satisfying the cuts ==>", pressure

# Read also the names with the same cuts
names = [ x.name for x in table.readAsRecords()
	      if x.TDCcount > 3 and x.pressure < 50 ]

print
print	'-**-**-**-**-**-**- array object creation  -**-**-**-**-**-**-**-'

print "Creating a new group called '/columns' to hold new arrays"
gcolumns = h5file.createGroup(h5file.root, "columns", "Pressure and Name")

print "Creating a Numeric array called 'pressure' under '/columns' group"
h5file.createArray(gcolumns, 'pressure', array(pressure), 
                   "Pressure column selection")

print "Creating another Numeric array called 'name' under '/columns' group"
h5file.createArray('/columns', 'name', array(names),
                   "Name column selection")

# Close the file
h5file.close()
print "File '"+filename+"' created"