[Numpy-discussion] [Fwd: compression in storage of Numeric/numarray objects]

Mon Sep 12 06:02:34 EDT 2005

El dv 09 de 09 del 2005 a les 22:41 +0200, en/na Joost van Evert va
escriure:
> On Fri, 2005-09-09 at 15:06 -0500, John Hunter wrote:
> Thanks, this helps me, but I think not enough, because the arrays I work
> on are sometimes >1Gb(Correlation matrices). The tostring method would
> explode the size, and result in a lot of swapping. Ideally the
> compression also works with memmory mapped arrays.

[mode advertising on, be warned <wink>]

You may want to use pytables [1]. It supports on-line data compression
and access to data on-disk on a similar way than memory-mapped arrays.

Example of use:

In [66]:f=tables.openFile("/tmp/test-zlib.h5","w")

In [67]:fzlib=tables.Filters(complevel=1, complib="zlib") # the filter

In [68]:chunk=tables.Float64Atom(shape=(50,50))  # the data 'chunk'

In [69]:carr=f.createCArray(f.root, "carr",(1000, 1000),chunk,'',fzlib)

In [70]:carr[:]=numarray.random_array.random((1000,1000))

In [71]:f.close()

In [72]:ls -l /tmp/test-zlib.h5
-rw-r--r--  1 faltet users 3680721 2005-09-12 14:27 /tmp/test-zlib.h5

Now, you can access the data on disk as if it was in-memory:
In [73]:f=tables.openFile("/tmp/test-zlib.h5","r")

In [74]:f.root.carr[300,200]
Out[74]:0.76497000455856323

In [75]:f.root.carr[300:310:3,900:910:2]
Out[75]:
array([[ 0.5336495 , 0.55542123,  0.80049258,  0.84423071,  0.47674203],
       [ 0.93104523, 0.71216697,  0.23955345,  0.89759707,  0.70620197],
       [ 0.86999339, 0.05541291,  0.55156851,  0.96808773,  0.51768076],
       [ 0.29315394, 0.03837755,  0.33675179,  0.93591529, 0.99721605]])

Also, access to disk is very fast, even if you compressed your data:

In [77]:tzlib=timeit.Timer("carr[300:310:3,900:910:2]","import
tables;f=tables.openFile('/tmp/test-zlib.h5');carr=f.root.carr")

In [78]:tzlib.repeat(3,100)
Out[78]:[0.204339981079101, 0.176630973815917, 0.177133798599243]

Compare these times with non-compressed data:

In [80]:tnc=timeit.Timer("carr[300:310:3,900:910:2]","import
tables;f=tables.openFile('/tmp/test-nocompr.h5');carr=f.root.carr")

In [81]:tnc.repeat(3,100)
Out[81]:[0.089105129241943, 0.084129095077514, 0.084383964538574219]

That means that pytables can access data in the middle of a dataset
without decompressing all the dataset, but just the interesting chunks
(and you can decide the size of these chunks). You can see how the
access times are in the range of milliseconds, irregardingly of the fact
that the data is compressed or not.

PyTables also does support others compressors apart from zlib, like
bzip2 [2] or LZO [3], as well as compression pre-conditioners, like
shuffle [4]. Look at the compression ratios for completely random data:

In [84]:ls -l /tmp/test*.h5
-rw-r--r--  1 faltet users 3675874 /tmp/test-bzip2-shuffle.h5
-rw-r--r--  1 faltet users 3680615 /tmp/test-zlib-shuffle.h5
-rw-r--r--  1 faltet users 3777749 /tmp/test-lzo-shuffle.h5
-rw-r--r--  1 faltet users 8025024 /tmp/test-nocompr.h5

LZO is specially interesting if you want fast access to your data (it's
very fast decompressing):

In [82]:tlzo=timeit.Timer("carr[300:310:3,900:910:2]","import
tables;f=tables.openFile('/tmp/test-lzo-shuffle.h5');carr=f.root.carr")

In [83]:tlzo.repeat(3,100)
Out[83]:[0.12332820892333984, 0.11892890930175781, 0.12009191513061523]

So, retrieving compressed data using LZO is just 45% slower than if not
using compression. You can see more exhaustive benchmarks and discussion
in [5].

[1] http://www.pytables.org
[2] http://www.bzip2.org
[3] http://www.oberhumer.com/opensource/lzo
[4] http://hdf.ncsa.uiuc.edu/HDF5/doc_resource/H5Shuffle_Perf.pdf
[5] http://pytables.sourceforge.net/html-doc/usersguide6.html#section6.3

Uh, sorry by the blurb, but benchmarking is a lot of fun.

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"