[Numpy-discussion] loading data

Anne Archibald peridot.faceted at gmail.com
Thu Jun 25 21:50:28 EDT 2009

2009/6/25 Mag Gam <magawake at gmail.com>:
> Hello.
> I am very new to NumPy and Python. We are doing some research in our
> Physics lab and we need to store massive amounts of data (100GB
> daily). I therefore, am going to use hdf5 and h5py. The problem is I
> am using np.loadtxt() to create my array and create a dataset
> according to that. np.loadtxt() is reading a file which is about 50GB.
> This takes a very long time! I was wondering if there was a much
> easier and better way of doing this.

If you are stuck with the text array, you probably can't beat
numpy.loadtxt(); reading a 50 GB text file is going to be slow no
matter how you cut it. So I would take a look at the code that
generates the text file, and see if there's any way you can make it
generate a format that is faster to read. (I assume the code is in C
or FORTRAN and you'd rather not mess with it more than necessary).

Of course, generating hdf5 directly is probably fastest; you might
look at the C and FORTRAN hdf5 libraries and see how hard it would be
to integrate them into the code that currently generates a text file.
Even if you need to have a python script to gather the data and add
metadata, hdf5 will be much much more efficient than text files as an
intermediate format.

If integrating HDF5 into the generating application is too difficult,
you can try simply generating a binary format. Using numpy's
structured data types, it is possible to read in binary files
extremely efficiently. If you're using the same architecture to
generate the files as read them, you can just write out raw binary
arrays of floats or doubles and then read them into numpy. I think
FORTRAN also has a semi-standard padded binary format which isn't too
difficult to read either. You could even use numpy's native file
format, which for a single array should be pretty straightforward, and
should yield portable results.

If you really can't modify the code that generates the text files,
your code is going to be slow. But you might be able to make it
slightly less slow. If, for example, the text files are a very
specific format, especially if they're made up of columns of fixed
width, it would be possible to write compiled code to read them
slightly more quickly. (The very easiest way to do this is to write a
little C program that reads the text files and writes out a slightly
friendlier format, as above.) But you may well find that simply
reading a 50 GB file dominates your run time, which would mean that
you're stuck with slowness.

In short: avoid text files if at all possible.

Good luck,

> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

More information about the NumPy-Discussion mailing list