[Numpy-discussion] Data file format choice.

Fri Jan 30 14:42:23 EST 2009

A Friday 30 January 2009, Jeff Whitaker escrigué:
> Gary Pajer wrote:
> > It's time for me to select a data format.
> >
> > My data are (more or less) spectra ( a couple of thousand samples),
> > six channels, each channel running around 10 Hz, collecting for a
> > minute or so. Plus all the settings on the instrument.
> >
> > I don't see any significant differences between netCDF4 and HDF5.
>
> Gary:  netCDF4 is just a thin wrapper on top of HDF5 1.8 - think of
> it as a higher level API.
>
> > Similarly, I don't see significant differences between pytables and
> > h5py.  Does one play better with numpy?
>
> pytables has been around longer and is well-tested, has nice pythonic
> features, but files you write with it may not be readable by C or
> fortran clients.

Just to be clear.  PyTables only will write pickled objects on file if 
it is not possible to reasonably represent them as native HDF5 objects.  
But, if you try to save NumPy objects or regular Python scalars they 
are effectively written as native HDF5 objects (see [1]).

Regarding a comparison with h5py (disclaimer: I'm the main author of 
PyTables), I'd say that h5py is thought to have a direct map with NumPy 
array capabilities, but doesn't try to go further.  Also, it is worth 
to note that h5py offers access to the low-level HDF5 functions, which 
can be interesting if you want to get deeper into HDF5 intrincacies, 
which can be great for some users.  

On his hand, PyTables doesn't try to go this low-level and, besides 
supporting general NumPy objects, it is more focused on implementing 
advanced features that are normally only available in database-oriented 
approaches, like enumerated types, flexible query iterators for tables 
(on-disk equivalent to recarrays), indexing (only Pro version), do/undo 
features or natural naming (for an enhanced interactive experience).  
PyTables also tries hard to be a high performance interface to HDF5, 
implementing niceties like internal LRU caches for nodes, automatic 
chunksizes for the datasets or making use of numexpr internally so as 
to accelerate queries to a maximum.

Finally, and although h5py is relatively recent, I'm really impressed by 
the work that Andrew has already done, and in fact, I'm looking forward 
to backport some of the h5py features (like general NumPy-like fancy 
indexing capabilities) to PyTables.  At any rate, it is clear that the 
binomial h5py/PyTables will benefit users, with the only handicap that 
they have to choose their preferred API to HDF5 (or they can use both, 
which could be really a lot of fun ;-).  NetCDF4-based interfaces are 
also probably a good approach and, as it is based in HDF5, the 
compatibility is ensured.

HTH,

[1] http://www.pytables.org/docs/manual/ch04.html#id2553542

-- 
Francesc Alted