A Dilluns 20 Setembre 2004 15:16, Timo Korvola va escriure:
... which appears to be actually a HDF5 file. Thanks for the tip. It is clear that a binary file format would be more advantageous simply because text files are not seekable in the way needed for parallel reading.
Well, if you are pondering using parallel reading because of speed, try first PyTables, you may get surprised how fast it can be. For example, using the same example that Todd has sent today (i.e. writing and reading an array of (10**5,3) integer elements), I've re-run it using PyTables and, just for the sake of comparison, NetCDF (using the Scientific Python wrapper). Here are the results (using a laptop with Pentium IV @ 2 GHz with Debian GNU/Linux): Time to write file (text mode) 2.12 sec Time to write file (NetCDF version) 0.0587 sec Time to write file (PyTables version) 0.00682 sec Time to read file (strings.fasteval version) 0.259 sec Time to read file (NetCDF version) 0.0470 sec Time to read file (PyTables version) 0.00423 sec so, for reading, PyTables can be more than 60 times faster than numarray.strings.eval and almost 10 times faster than Scientific.IO.NetCDF (the latter using Numeric). And I'm pretty sure that these ratios would increase for bigger datasets.
I was thinking of using NetCDF because OpenDX does not support HDF5.
Are you sure? Here you have a couple of OpenDX data importers for HDF5: http://www.cactuscode.org/VizTools/OpenDX.html http://www-beams.colorado.edu/dxhdf5/
An advantage of HDF5 would be that the libraries support parallel I/O via MPI-IO but can this be utilised in PyTables? There is the problem that there are no standard MPI bindings for Python.
Curiously enough Paul Dubois asked me the very same question during the recent SciPy '04 Conference. And the answer is the same: PyTables does not support MPI-IO at this time, because I guess that could be a formidable developer time waster. I think I should try first make PyTables threading-aware before embarking myself in larger entreprises. I recognize, though, that a MPI-IO-aware PyTables would be quite nice.
I have also considered writing Python bindings for Parallel-NetCDF but I suppose that would not be totally trivial even if the library turns out to be well Swiggable.
Before doing that, talk with Konrad. I know that Scientific Python supports MPI and BSPlib right-out-of-the-box, so maybe there is a shorter path to do what you want. In addition, you must be aware that the next version of NetCDF (the 4), will be implemented on top of HDF5 [1]. So, perhaps spending your time writing Python bindings for Parallel-HDF5 would be a better bet for future applications. [1] http://my.unidata.ucar.edu/content/software/netcdf/netcdf-4/index.html Cheers, -- Francesc Alted