ANN: PyTables 0.7 released
Francesc Alted
falted@openlc.org
Fri, 1 Aug 2003 01:20:56 +0200
Announcing PyTables 0.7
-----------------------
PyTables is a hierarchical database package designed to efficently
manage very large amounts of data. PyTables is built on top of the
HDF5 library and the numarray package and features an object-oriented
interface that, combined with C-code generated from Pyrex sources,
makes it a fast, yet extremely easy to use tool for interactively save
and retrieve large amounts of data.
Release 0.7 is the third public beta release. The version 0.6 was
internal and will never be released.
On this release you will find:
- new AttributeSet class
- 25% I/O speed improvement
- fully multidimensional table cells support
- new column descriptors
- row deletion in tables is finally here
- much more!
More in detail:
What's new
-----------
- A new AttributeSet class has been added. This will allow the
addition and deletion of generic attributes (any scalar type plus
any Python object supported by Pickle) as easy as this:
table.attrs.date = "2003/07/28 10:32" # Attach a string to table
group._v_attrs.tempShift = 1.2 # Attach a float to group
array.attrs.detectorList = [1,2,3,4] # Attach a list to array
del array.attrs.detectorList # Detach detectorList attr from array
- PyTables now has support for fully multidimensional table cells. This
has been made possible in part by implementation of multidimensional
cells in numarray.records.RecArray object. Thanks to numarray crew,
and especially to Jin-chung Hsu, for willingly accepting to do
that, and also for including some cache improvements in RecArray.
- New column descriptors added: IntCol, Int8Col, UInt8Col, Int16Col,
UInt16Col, Int32Col, UInt32Col, Int64Col, UInt64Col, FloatCol,
Float32Col, Float64Col and StringCol. I think they are more explicit
and easy-to-use than the now deprecated (but still supported)
Col() descriptor. All the examples and user's manual has been
accordingly updated.
- The new Table.removeRows(start, stop) function allows you to remove
rows from tables. This feature was requested a long time ago. There
are still limitations, however: you cannot delete rows in extremely
large Tables (as the remaining rows after the stop parameter
are stored in memory). Nor is the performance optimized. These issues
will hopefully be addressed in future releases.
- Added iterators to File, Group and Table (they now support the special
__iter__() method). They make the object much more user-friendly,
especially in interactive mode. See documentation for usage examples.
- Added a __getitem__() method to Table that works more or less like
read(), but with extended slices support.
- As a consequence of rewriting table iterators in C (with the help of
Pyrex, of course) the table read performance has been improved
between 20% and 30%. Data selections in PyTables are now starting to
beat powerful relational databases like SQLite, even compared to
in-core selects (!). I think there is still room for another 20% or
30% speed improvement, so stay tuned.
- A checksum is now added automatically when using LZO (not with UCL
where I'm having some difficulties implementing that
capability). The Adler32 algorithm has been chosen because of its
speed. With that, the compressing/decompressing speed has dropped 1%
or 2%, which is hardly noticeable. I think this addition will allow
the cautious user to be a bit more confident about this excellent
compressor. Code has been added to be able to read files created
without this checksum (so you can be confident that you will be able
to read your existing files compressed with LZO and UCL).
- Recursion has been removed from PyTables. Before, this made the
maximum depth tree to be less than the Python recursion limit (which
depends on implementation, but is around 900, at least in
Linux). Now, the limit has been set (somewhat arbitrarily) at
2048. Thanks to John Nielsen for implementing the new iterative
method!.
- A new rootUEP parameter to openFile() has been added. You can now
define the root from which you want to start to build the object tree.
Thanks to John Nielsen for the suggestion and a first implementation.
- A small bug fixed when dealing with non-native PyTables files that
prevented the use of the "classname" filter during a listNodes()
call. Thanks to Jeff Robbins for reporting that.
- Some (non-serious) bugs were discovered and fixed.
- Updated documentation to explain all these new bells and whistles. It
is also available on the web:
http://pytables.sourceforge.net/html-doc/usersguide-html.html
- Added more unit tests (more than 350 now!)
- PyTables 0.7 *needs* numarray 0.6 or higher and HDF-1.6.0 or higher
to compile and work. It has been tested with Python 2.2 and 2.3 and
should work fine on both versions.
What is a table?
----------------
A table is defined as a collection of records whose values are stored
in fixed-length fields. All records have the same structure and all
values in each field have the same data type. The terms
"fixed-length" and "strict data types" seems to be quite a strange
requirement for an language like Python, that supports dynamic data
types, but they serve a useful function if the goal is to save very
large quantities of data (such as is generated by many scientific
applications, for example) in an efficient manner that reduces demand
on CPU time and I/O resources.
What is HDF5?
-------------
For those people who know nothing about HDF5, it is is a general
purpose library and file format for storing scientific data made at
NCSA. HDF5 can store two primary objects: datasets and groups. A
dataset is essentially a multidimensional array of data elements, and
a group is a structure for organizing objects in an HDF5 file. Using
these two basic constructs, one can create and store almost any kind of
scientific data structure, such as images, arrays of vectors, and
structured and unstructured grids. You can also mix and match them in
HDF5 files according to your needs.
Platforms
---------
I'm using Linux as the main development platform, but PyTables should
be easy to compile/install on other UNIX machines. This package has
also passed all the tests on a UltraSparc platform with Solaris 7 and
Solaris 8. It also compiles and passes all the tests on a SGI
Origin2000 with MIPS R12000 processors and running IRIX 6.5.
Regarding Windows platforms, PyTables has been tested with Windows
2000 and Windows XP, but it should also work with other flavors.
An example?
-----------
For online code examples, have a look at
http://pytables.sourceforge.net/tut/tutorial1-1.html
and
http://pytables.sourceforge.net/tut/tutorial1-2.html
Web site
--------
Go to the PyTables web site for more details:
http://pytables.sourceforge.net/
Share your experience
---------------------
Let me know of any bugs, suggestions, gripes, kudos, etc. you may
have.
Have fun!
-- Francesc Alted
falted@openlc.org