[Numpy-discussion] loading data

Fri Jun 26 07:09:13 EDT 2009

I really like the slice by slice idea!

But, I don't know how to implement the code.  Do you have any sample code?

I suspect its the writing portion thats taking the lonest. I did a
simple decompress test and its fast.

On Fri, Jun 26, 2009 at 7:05 AM, Francesc Alted<faltet at pytables.org> wrote:
> A Friday 26 June 2009 12:38:11 Mag Gam escrigué:
>> Thanks everyone for the great and well thought out responses!
>>
>> To make matters worse, this is actually a 50gb compressed csv file. So
>> it looks like this, 2009.06.01.plasmasub.csv.gz
>> We get this data from another lab from the Westcoast every night
>> therefore I don't have the option to have this file natively in hdf5.
>> We are sticking with hdf5 because we have other applications that use
>> this data and we wanted to standardize hdf5.
>
> Well, since you are adopting HDF5, the best solution is that the Westcoast lab
> would send the file directly in HDF5.  That will save you a lot of headaches.
> If this is not possible, then I think the best would be that you do some
> profiles in your code and see where the bottleneck is.  Using cProfile
> normally offers a good insight on what's consuming more time in your
> converter.
>
> There are three most probable hot spots, the decompressor (gzip) time, the
> np.loadtxt and the HDF5 writer function.  If the problem is gzip, then you
> won't be unable to accelerate the conversion unless the other lab is willing
> to use a lighter compressor (lzop, for example).  If it is np.loadtxt(), then
> you should ask yourself if you are trying to load everything in-memory; if you
> are, don't do that; just try to load & write slice by slice.  Finally, if the
> problem is on the HDF5 write, try to use write array slices (and not record-
> by-record writes).
>
>> Also, I am curious about Neil's  np.memmap. Do you have a some sample
>> code for mapping a compressed csv file into memory? and loading the
>> dataset into a dset (hdf5 structure)?
>
> No, np.memmap is meant to map *uncompressed binary* files in memory, so you
> can't follow this path.
>
> --
> Francesc Alted
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>