[Numpy-discussion] loading data

Fri Jun 26 06:38:11 EDT 2009

Thanks everyone for the great and well thought out responses!

To make matters worse, this is actually a 50gb compressed csv file. So
it looks like this, 2009.06.01.plasmasub.csv.gz
We get this data from another lab from the Westcoast every night
therefore I don't have the option to have this file natively in hdf5.
We are sticking with hdf5 because we have other applications that use
this data and we wanted to standardize hdf5.

Since my file is in csv, would it better for me to create a a tsv file
temporarily and have np.loadtxt ?

Also, I am curious about Neil's  np.memmap. Do you have a some sample
code for mapping a compressed csv file into memory? and loading the
dataset into a dset (hdf5 structure)?

TIA

On Thu, Jun 25, 2009 at 9:50 PM, Anne
Archibald<peridot.faceted at gmail.com> wrote:
> 2009/6/25 Mag Gam <magawake at gmail.com>:
>> Hello.
>>
>> I am very new to NumPy and Python. We are doing some research in our
>> Physics lab and we need to store massive amounts of data (100GB
>> daily). I therefore, am going to use hdf5 and h5py. The problem is I
>> am using np.loadtxt() to create my array and create a dataset
>> according to that. np.loadtxt() is reading a file which is about 50GB.
>> This takes a very long time! I was wondering if there was a much
>> easier and better way of doing this.
>
> If you are stuck with the text array, you probably can't beat
> numpy.loadtxt(); reading a 50 GB text file is going to be slow no
> matter how you cut it. So I would take a look at the code that
> generates the text file, and see if there's any way you can make it
> generate a format that is faster to read. (I assume the code is in C
> or FORTRAN and you'd rather not mess with it more than necessary).
>
> Of course, generating hdf5 directly is probably fastest; you might
> look at the C and FORTRAN hdf5 libraries and see how hard it would be
> to integrate them into the code that currently generates a text file.
> Even if you need to have a python script to gather the data and add
> metadata, hdf5 will be much much more efficient than text files as an
> intermediate format.
>
> If integrating HDF5 into the generating application is too difficult,
> you can try simply generating a binary format. Using numpy's
> structured data types, it is possible to read in binary files
> extremely efficiently. If you're using the same architecture to
> generate the files as read them, you can just write out raw binary
> arrays of floats or doubles and then read them into numpy. I think
> FORTRAN also has a semi-standard padded binary format which isn't too
> difficult to read either. You could even use numpy's native file
> format, which for a single array should be pretty straightforward, and
> should yield portable results.
>
> If you really can't modify the code that generates the text files,
> your code is going to be slow. But you might be able to make it
> slightly less slow. If, for example, the text files are a very
> specific format, especially if they're made up of columns of fixed
> width, it would be possible to write compiled code to read them
> slightly more quickly. (The very easiest way to do this is to write a
> little C program that reads the text files and writes out a slightly
> friendlier format, as above.) But you may well find that simply
> reading a 50 GB file dominates your run time, which would mean that
> you're stuck with slowness.
>
>
> In short: avoid text files if at all possible.
>
>
> Good luck,
> Anne
>
>> TIA
>> _______________________________________________
>> Numpy-discussion mailing list
>> Numpy-discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>