Data file format choice.
It's time for me to select a data format. My data are (more or less) spectra ( a couple of thousand samples), six channels, each channel running around 10 Hz, collecting for a minute or so. Plus all the settings on the instrument. I don't see any significant differences between netCDF4 and HDF5. Similarly, I don't see significant differences between pytables and h5py. Does one play better with numpy? What are the best numpy solutions for netCDF4? Can anyone provide thoughts, pros and cons, etc, that I can mull over? -gary
It's time for me to select a data format.
My data are (more or less) spectra ( a couple of thousand samples), six channels, each channel running around 10 Hz, collecting for a minute or so. Plus all the settings on the instrument.
I don't see any significant differences between netCDF4 and HDF5. Gary: netCDF4 is just a thin wrapper on top of HDF5 1.8 - think of it as a higher level API. Similarly, I don't see significant differences between pytables and h5py. Does one play better with numpy?
Gary Pajer wrote: pytables has been around longer and is well-tested, has nice pythonic features, but files you write with it may not be readable by C or fortran clients. h5py works only with python 2.5/2.6, and writes 'vanilla' hdf5 files readable by anybody.
What are the best numpy solutions for netCDF4?
There's only one that I know of - http://code.google.com/p/netcdf4-python. -Jeff -- Jeffrey S. Whitaker Phone : (303)497-6313 Meteorologist FAX : (303)497-6449 NOAA/OAR/PSD R/PSD1 Email : Jeffrey.S.Whitaker@noaa.gov 325 Broadway Office : Skaggs Research Cntr 1D-113 Boulder, CO, USA 80303-3328 Web : http://tinyurl.com/5telg
A Friday 30 January 2009, Jeff Whitaker escrigué:
Gary Pajer wrote:
It's time for me to select a data format.
My data are (more or less) spectra ( a couple of thousand samples), six channels, each channel running around 10 Hz, collecting for a minute or so. Plus all the settings on the instrument.
I don't see any significant differences between netCDF4 and HDF5.
Gary: netCDF4 is just a thin wrapper on top of HDF5 1.8 - think of it as a higher level API.
Similarly, I don't see significant differences between pytables and h5py. Does one play better with numpy?
pytables has been around longer and is well-tested, has nice pythonic features, but files you write with it may not be readable by C or fortran clients.
Just to be clear. PyTables only will write pickled objects on file if it is not possible to reasonably represent them as native HDF5 objects. But, if you try to save NumPy objects or regular Python scalars they are effectively written as native HDF5 objects (see [1]). Regarding a comparison with h5py (disclaimer: I'm the main author of PyTables), I'd say that h5py is thought to have a direct map with NumPy array capabilities, but doesn't try to go further. Also, it is worth to note that h5py offers access to the low-level HDF5 functions, which can be interesting if you want to get deeper into HDF5 intrincacies, which can be great for some users. On his hand, PyTables doesn't try to go this low-level and, besides supporting general NumPy objects, it is more focused on implementing advanced features that are normally only available in database-oriented approaches, like enumerated types, flexible query iterators for tables (on-disk equivalent to recarrays), indexing (only Pro version), do/undo features or natural naming (for an enhanced interactive experience). PyTables also tries hard to be a high performance interface to HDF5, implementing niceties like internal LRU caches for nodes, automatic chunksizes for the datasets or making use of numexpr internally so as to accelerate queries to a maximum. Finally, and although h5py is relatively recent, I'm really impressed by the work that Andrew has already done, and in fact, I'm looking forward to backport some of the h5py features (like general NumPy-like fancy indexing capabilities) to PyTables. At any rate, it is clear that the binomial h5py/PyTables will benefit users, with the only handicap that they have to choose their preferred API to HDF5 (or they can use both, which could be really a lot of fun ;-). NetCDF4-based interfaces are also probably a good approach and, as it is based in HDF5, the compatibility is ensured. HTH, [1] http://www.pytables.org/docs/manual/ch04.html#id2553542 -- Francesc Alted
participants (3)
-
Francesc Alted
-
Gary Pajer
-
Jeff Whitaker