Announcing PyTables 0.9.1
This release is mainly a maintenance version. In it, some bugs has
been fixed and a few improvements has been made. One important thing
is that chunk sizes in EArrays has been re-tuned to get much better
performance and compression rations. Besides, it has been tested
against the latest Python 2.4 and all test units seems to pass fine.
What it is
PyTables is a solid hierarchical database package designed to
efficiently manage extremely large amounts of data (with support for
full 64-bit file addressing). It features an object-oriented interface
that, combined with C extensions for the performance-critical parts of
the code, makes it a very easy-to-use tool for high performance data
storage and retrieval.
It is built on top of the HDF5 library and the numarray package, and
provides containers for both heterogeneous data (Tables) and
homogeneous data (Array, EArray) as well as containers for keeping
lists of objects of variable length (like Unicode strings or general
Python objects) in a very efficient way (VLArray). It also sports a
series of filters allowing you to compress your data on-the-fly by
using different compressors and compression enablers.
But perhaps the more interesting features are its powerful browsing
and searching capabilities that allow you to perform data selections
over heterogeneous datasets exceeding gigabytes of data in just tenths
of second. Besides, all the PyTables I/O is buffered, implemented in C
and carefully tuned so that you can reach much better performance with
PyTables than with your own home-grown wrappings to the HDF5 library.
Changes more in depth
- The chunksize computation for EArrays has been re-tuned to allow
better performance and *much* better compression rations.
- New --unpackshort and --quantize flags has been added to nctoh5
script. --unpackshort unpack short integer variables to float
variables using scale_factor and add_offset netCDF variable
attributes. --quantize quantize data to improve compression using
least_significant_digit netCDF variable attribute (not active by
for further explanation of what this attribute means. Thanks to Jeff
Whitaker for providing this.
- Table.itersequence has received a new parameter called "sort". This
allows to disable the sorting of the sequence in case the user wants
- Now, the AttributeSet class throw an AttributeError on __getattr__
for nonexistent attributes in it. Formerly, the routine returned
None, which is pretty much against convention in Python and breaks
the built-in hasattr() function. Thanks to Norbert Nemec for noting
this and offering a patch.
- VLArray.read() has changed its behaviour. Now, it always returns a
list, as stated in documentation, even when the number of elements
to return is 0 or 1. This is much more consistent when representing
the actual number of elements on a certain VLArray row.
- A Row.getTable() has been added. It is an accessor for the associated
- A File.copyAttrs() has been added. It allows copying attributes from
one leaf to other. Properly speaking, this was already there, but not
- Now, the copy of hierarchies works even when there are scalar Arrays
(i.e. Arrays which shape is ()) on it. Thanks to Norbert Nemec for
providing a patch.
- Solved a memory leak regarding the Filters instance associated with
the File object, that was not released after closing the file. Now,
there are no known leaks on PyTables itself.
- Fixed a bug in Table.append() when the table was indexed. The problem
was that if table was in auto-indexing mode, some rows were lost in
the indexation process and hence, not indexed correctly.
- Improved security of nodes name checking. Closes #1074335
Important note for Python 2.4 and Windows users
If you are willing to use PyTables with Python 2.4 in Windows
platforms, you will need to get the HDF5 library compiled for MSVC
7.1, aka .NET (and possible LZO and UCL as well, if you want support
for LZO and UCL at all). It can be found at:
Where can PyTables be applied?
PyTables is not designed to work as a relational database competitor,
but rather as a teammate. If you want to work with large datasets of
multidimensional data (for example, for multidimensional analysis), or
just provide a categorized structure for some portions of your
cluttered RDBS, then give PyTables a try. It works well for storing
data from data acquisition systems (DAS), simulation software, network
data monitoring systems (for example, traffic measurements of IP
packets on routers), very large XML files, or for creating a
centralized repository for system logs, to name only a few possible
What is a table?
A table is defined as a collection of records whose values are stored in
fixed-length fields. All records have the same structure and all values
in each field have the same data type. The terms "fixed-length" and
"strict data types" seem to be quite a strange requirement for a
language like Python that supports dynamic data types, but they serve a
useful function if the goal is to save very large quantities of data
(such as is generated by many scientific applications, for example) in
an efficient manner that reduces demand on CPU time and I/O resources.
What is HDF5?
For those people who know nothing about HDF5, it is a general purpose
library and file format for storing scientific data made at NCSA. HDF5
can store two primary objects: datasets and groups. A dataset is
essentially a multidimensional array of data elements, and a group is a
structure for organizing objects in an HDF5 file. Using these two basic
constructs, one can create and store almost any kind of scientific data
structure, such as images, arrays of vectors, and structured and
unstructured grids. You can also mix and match them in HDF5 files
according to your needs.
I'm using Linux (Intel 32-bit) as the main development platform, but
PyTables should be easy to compile/install on many other UNIX
machines. This package has also passed all the tests on a UltraSparc
platform with Solaris 7 and Solaris 8. It also compiles and passes all
the tests on a SGI Origin2000 with MIPS R12000 processors, with the
MIPSPro compiler and running IRIX 6.5. It also runs fine on Linux
64-bit platforms, like AMD Opteron running GNU/Linux 2.4.21 Server,
Intel Itanium (IA64) running GNU/Linux 2.4.21 or PowerPC G5 with Linux
2.6.x in 64bit mode. It has also been tested in MacOSX platforms (10.2
but should also work on newer versions).
Regarding Windows platforms, PyTables has been tested with Windows
2000 and Windows XP (using the Microsoft Visual C compiler), but it
should also work with other flavors as well.
Go to the PyTables web site for more details:
To know more about the company behind the PyTables development, see:
Share your experience
Let me know of any bugs, suggestions, gripes, kudos, etc. you may
Who's your data daddy? PyTables
Once again I forgot to reply-all...
Florian apparently figured out whatever flaws were in my original
message attached here. One thing people interested in this kind of data
aliasing should keep in mind is that all of the hidden variables I
mentioned (in the attached) exist to make things like array slices and
record arrays work properly. Just reusing the _data, by itself, is not
sufficient to work *in general* because not all of the _data is always
used. It does work for the relatively simple case of a new array, but
might fail if the array came from a field of a recarray, a numerical
array slice, a transposed numerical array, etc. Be careful.
I would like to be able to access the same array (memory location) with
arrays of different size and with different typecodes. Let's say I have an
array of 8 UInt8 and I want to view it as 2 UInt32. I want to be able to
change the content of either array and the change should be visible in
both arrays. To speak in C notation, I want an *UInt8 and a *Uint32 to the
same memory location.
Is that possible with Numeric or numarray, maybe even with slices of the
The reason I want this is that I want to prevent copying memory around. It
would be even cooler if this would work with mmaped arrays, though then
it's enough when it would work with read only mmaps. BTW, why isn't it
allowed to create overlapping mmap slices?
How can I control my output presentation. There are
times when using: print array_str(a,suppress_small=1),
does not print a float as intended, but remains to
output in exponential form. I am using a win32 os and
mostly using PythonWin shell but also tried using
wxPython's shell for same results. I tried several
other aspects such as sys.float_output_suppress_small
to no avail.
Another question I have is: performing an operation on
an array produces a new array index subscript. Is it
possible to re-assign the original index subscript to
an array after performing the operation. (tunnel
Thank you very much,
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
I have some trouble with the new numarray library.
Most of the time the numarray code prints out some negative value
Numeric code yields positive values.
This behaviour justs shows up when using large arrays.
Can anybody point out what I need to change in the numarray version to
get the Numeric behaviour?
System: WinXP, python2.3.4, Numeric23.5, numarray1.1.1
from numarray import greater, reshape, trace
from numarray.random_array import standard_normal
A = greater(standard_normal(n*n), 0.9)
A = reshape(A, (n, n))
from Numeric import greater, reshape, trace
from RandomArray import standard_normal
A = greater(standard_normal(n*n), 0.9)
A = reshape(A, (n, n))
Thank you for your help.