PyTables is a library for managing hierarchical datasets and designed to efficiently cope with extremely large amounts of data with support for full 64-bit file addressing. PyTables runs on top of the HDF5 library and NumPy package for achieving maximum throughput and convenient use.
This is the second (and probably last) release candidate for PyTables 2.0. On it, together with the traditional bunch of bug fixes, you will find a handful of optimizations for dealing with very large tables. Also, the "Optimization tips" chapter of User's Guide has been updated and the manual is almost ready (bar some errors or typos we may have introduced) for the long awaited 2.0 final release. In particular, the "Indexed searches" section shows pretty definitive plots on the performance of the completely new and innovative indexing engine that will be available in the Pro version (to be released very soon now).
You can download a source package of the version 2.0rc2 with generated PDF and HTML docs and binaries for Windows from http://www.pytables.org/download/preliminary/
For an on-line version of the manual, visit: http://www.pytables.org/docs/manual-2.0rc2
In case you want to know more in detail what has changed in this
version, have a look at
RELEASE_NOTES.txt. Find the HTML version
for this document at:
If you are a user of PyTables 1.x, probably it is worth for you to look
MIGRATING_TO_2.x.txt file where you will find directions on how
to migrate your existing PyTables 1.x apps to the 2.0 version. You can
find an HTML version of this document at
Keep reading for an overview of the most prominent improvements in PyTables 2.0 series.
A complete refactoring of many, many modules in PyTables. With this, the different parts of the code are much better integrated and code redundancy is kept under a minimum. A lot of new optimizations have been included as well, making working with it a smoother experience than ever before.
NumPy is finally at the core! That means that PyTables no longer needs numarray in order to operate, although it continues to be supported (as well as Numeric). This also means that you should be able to run PyTables in scenarios combining Python 2.5 and 64-bit platforms (these are a source of problems with numarray/Numeric because they don't support this combination as of this writing).
Most of the operations in PyTables have experimented noticeable speed-ups (sometimes up to 2x, like in regular Python table selections). This is a consequence of both using NumPy internally and a considerable effort in terms of refactorization and optimization of the new code.
Combined conditions are finally supported for in-kernel selections. So, now it is possible to perform complex selections like::
result = [ row['var3'] for row in table.where('(var2 < 20) | (var1 == "sas")') ]
complex_cond = '((%s <= col5) & (col2 <= %s)) ' \ '| (sqrt(col1 + 3.1*col2 + col3*col4) > 3)' result = [ row['var3'] for row in table.where(complex_cond % (inf, sup)) ]
and run them at full C-speed (or perhaps more, due to the cache-tuned computing kernel of Numexpr, which has been integrated into PyTables).
Now, it is possible to get fields of the
Row iterator by
specifying their position, or even ranges of positions (extended
slicing is supported). For example, you can do::
result = [ row for row in table # fetch field #4 if row < 20 ] result = [ row[:] for row in table # fetch all fields if row['var2'] < 20 ] result = [ row[1::2] for row in # fetch odd fields table.iterrows(2, 3000, 3) ]
in addition to the classical::
result = [row['var3'] for row in table.where('var2 < 20')]
Row has received a new method called
order to easily retrieve all the fields of a row in situations like::
[row.fetch_all_fields() for row in table.where('column1 < 0.3')]
The difference between
that the former will return all the fields as a tuple, while the
latter will return the fields in a NumPy void type and should be
faster. Choose whatever fits better to your needs.
Now, all data that is read from disk is converted, if necessary, to
the native byteorder of the hosting machine (before, this only
Table objects). This should help to accelerate
applications that have to do computations with data generated in
platforms with a byteorder different than the user machine.
The modification of values in
*Array objects (through __setitem__)
now doesn't make a copy of the value in the case that the shape of the
value passed is the same as the slice to be overwritten. This results
in considerable memory savings when you are modifying disk objects
with big array values.
All leaf constructors (except for
Array) have received a new
chunkshape argument that lets the user explicitly select the
chunksizes for the underlying HDF5 datasets (only for advanced users).
All leaf constructors have received a new parameter called
byteorder that lets the user specify the byteorder of their data
on disk. This effectively allows to create datasets in other
byteorders than the native platform.
Native HDF5 datasets with
H5T_ARRAY datatypes are fully supported
for reading now.
The test suites for the different packages are installed now, so you don't need a copy of the PyTables sources to run the tests. Besides, you can run the test suite from the Python console by using::
Go to the PyTables web site for more details:
About the HDF5 library:
To know more about the company behind the development of PyTables, see:
Thanks to many users who provided feature improvements, patches, bug
reports, support and suggestions. See the
THANKS file in the
distribution package for a (incomplete) list of contributors. Many
thanks also to SourceForge who have helped to make and distribute this
package! And last, but not least thanks a lot to the HDF5 and NumPy
(and numarray!) makers. Without them PyTables simply would not exist.
Let us know of any bugs, suggestions, gripes, kudos, etc. you may have.
-- The PyTables Team
-- Francesc Altet | Be careful about using the following code -- Carabos Coop. V. | I've only proven that it works, www.carabos.com | I haven't tested it. -- Donald Knuth