[Numpy-discussion] Designing a new storage format for numpy recarrays

Fri Oct 30 09:48:54 EDT 2009

Dag Sverre Seljebotn:
> Hi,
>
> Is anyone working on alternative storage options for numpy arrays, and
> specifically recarrays? My main application involves processing series
> of large recarrays (say 1000 recarrays, each with 5M rows having 50
> fields). Existing options meet some but not all of my  requirements.
>
> Requirements
> --------------
> The basic requirements are:
>
> Mandatory
>  - fast
>  - suitable for very large arrays (larger than can fit in memory)
>  - compressed (to reduce disk space, read data more quickly)
>  - seekable (can read subset of data without decompressing everything)
>  - can append new data to an existing file
>  - able to extract individual fields from a recarray (for when indexing
> or processing needs just a few fields)
> Nice to have
>  - files can be split without decompressing and recompressing (e.g.
> distribute processing over a grid)
>  - encryption, ideally field-level, with encryption occurring after
> compression
>  - can store multiple arrays in one physical file (convenience)
>  - portable/stardard/well documented
>
> Existing options
> -----------------
> Over the last few years I've tried most of numpy's options for saving
> arrays to disk, including pickles, .npy, .npz, memmap-ed files and HDF
> (Pytables).
>
> None of these is perfect, although Pytables comes close:
>  - .npy - not compressed, need to read whole array into memory
>  - .npz - compressed but ZLIB compression is too slow
>  - memmap - not compressed
>  - Pytables (HDF using chunked storage for recarrays with LZO
> compression and shuffle filter)
>     - can't extract individual field from a recarray

I'm just learning PyTables so I'm curious about this... if I use a normal
Table, it will be presented as a NumPy record array when I access it, and
I can access individual fields. What are the disadvantages to that?

>     - multiple dependencies (HDF, PyTables+LZO, Pyh5+LZF)

(I think this is a pro, not a con: It means that there's a lot of already
bugfixed code being used. Any codebase is only as strong as the number of
eyes on it.)

Dag Sverre