[Fwd: compression in storage of Numeric/numarray objects]

"Joost" == Joost van Evert <phjoost@gmail.com> writes:
Joost> is it possible to use compression while storing Joost> numarray/Numeric objects? Sure In [35]: s = rand(10000) In [36]: file('uncompressed.dat', 'wb').write(s.tostring()) In [37]: ls -l uncompressed.dat -rw-r--r-- 1 jdhunter jdhunter 80000 2005-09-09 15:04 uncompressed.dat In [38]: gzip.open('compressed.dat', 'wb').write(s.tostring()) In [39]: ls -l compressed.dat -rw-r--r-- 1 jdhunter jdhunter 41393 2005-09-09 15:04 compressed.dat Compression ration for more regular data will be better. JDH

On Fri, 2005-09-09 at 15:06 -0500, John Hunter wrote:
"Joost" == Joost van Evert <phjoost@gmail.com> writes:
Joost> is it possible to use compression while storing Joost> numarray/Numeric objects?
Sure
In [35]: s = rand(10000)
In [36]: file('uncompressed.dat', 'wb').write(s.tostring())
In [37]: ls -l uncompressed.dat -rw-r--r-- 1 jdhunter jdhunter 80000 2005-09-09 15:04 uncompressed.dat
In [38]: gzip.open('compressed.dat', 'wb').write(s.tostring())
In [39]: ls -l compressed.dat -rw-r--r-- 1 jdhunter jdhunter 41393 2005-09-09 15:04 compressed.dat
Thanks, this helps me, but I think not enough, because the arrays I work on are sometimes >1Gb(Correlation matrices). The tostring method would explode the size, and result in a lot of swapping. Ideally the compression also works with memmory mapped arrays. Greets, Joost

On Sep 9, 2005, at 4:41 PM, Joost van Evert wrote:
On Fri, 2005-09-09 at 15:06 -0500, John Hunter wrote:
> "Joost" == Joost van Evert <phjoost@gmail.com> writes:
Joost> is it possible to use compression while storing Joost> numarray/Numeric objects?
Sure
In [35]: s = rand(10000)
In [36]: file('uncompressed.dat', 'wb').write(s.tostring())
In [37]: ls -l uncompressed.dat -rw-r--r-- 1 jdhunter jdhunter 80000 2005-09-09 15:04 uncompressed.dat
In [38]: gzip.open('compressed.dat', 'wb').write(s.tostring())
In [39]: ls -l compressed.dat -rw-r--r-- 1 jdhunter jdhunter 41393 2005-09-09 15:04 compressed.dat
Thanks, this helps me, but I think not enough, because the arrays I work on are sometimes >1Gb(Correlation matrices). The tostring method would explode the size, and result in a lot of swapping. Ideally the compression also works with memmory mapped arrays.
Well, it seems to me that you are asking for quite a lot if you expect it to work with memory-mapped arrays that are compressed (I'm assuming you mean that individual values are decompressed on the fly as they are needed). This is something that we gave some thought to a few years ago, but it seemed that supporting such capabilities was far too complicated, at least for now. Besides some operations are bound to blow up (e.g., take on a compressed array). But I'm still not sure what you are trying to do and what you would like to see happen underneath. An example would do a lot to explain what your needs are. Thanks, Perry Greenfield

You may be able to avoid the tostring() overhead by using tofile(): s.tofile(gzip.open('compressed.dat', 'wb')) You are probably SOL on the mmapping, though. w On Fri, 9 Sep 2005, Joost van Evert wrote:
On Fri, 2005-09-09 at 15:06 -0500, John Hunter wrote:
> "Joost" == Joost van Evert <phjoost@gmail.com> writes:
Joost> is it possible to use compression while storing Joost> numarray/Numeric objects?
Sure
In [35]: s = rand(10000)
In [36]: file('uncompressed.dat', 'wb').write(s.tostring())
In [37]: ls -l uncompressed.dat -rw-r--r-- 1 jdhunter jdhunter 80000 2005-09-09 15:04 uncompressed.dat
In [38]: gzip.open('compressed.dat', 'wb').write(s.tostring())
In [39]: ls -l compressed.dat -rw-r--r-- 1 jdhunter jdhunter 41393 2005-09-09 15:04 compressed.dat
Thanks, this helps me, but I think not enough, because the arrays I work on are sometimes >1Gb(Correlation matrices). The tostring method would explode the size, and result in a lot of swapping. Ideally the compression also works with memmory mapped arrays.
Greets,
Joost
------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Joost> is it possible to use compression while storing Joost> numarray/Numeric objects? Try the gzip or bz2 modules. Both have file-like objects that transparently (de)compress data as it is read or written. Joost> Ideally the compression also works with memmory mapped arrays. Dunno, but probably not. You'll have to experiment. Skip

El dv 09 de 09 del 2005 a les 22:41 +0200, en/na Joost van Evert va escriure:
On Fri, 2005-09-09 at 15:06 -0500, John Hunter wrote: Thanks, this helps me, but I think not enough, because the arrays I work on are sometimes >1Gb(Correlation matrices). The tostring method would explode the size, and result in a lot of swapping. Ideally the compression also works with memmory mapped arrays.
[mode advertising on, be warned <wink>] You may want to use pytables [1]. It supports on-line data compression and access to data on-disk on a similar way than memory-mapped arrays. Example of use: In [66]:f=tables.openFile("/tmp/test-zlib.h5","w") In [67]:fzlib=tables.Filters(complevel=1, complib="zlib") # the filter In [68]:chunk=tables.Float64Atom(shape=(50,50)) # the data 'chunk' In [69]:carr=f.createCArray(f.root, "carr",(1000, 1000),chunk,'',fzlib) In [70]:carr[:]=numarray.random_array.random((1000,1000)) In [71]:f.close() In [72]:ls -l /tmp/test-zlib.h5 -rw-r--r-- 1 faltet users 3680721 2005-09-12 14:27 /tmp/test-zlib.h5 Now, you can access the data on disk as if it was in-memory: In [73]:f=tables.openFile("/tmp/test-zlib.h5","r") In [74]:f.root.carr[300,200] Out[74]:0.76497000455856323 In [75]:f.root.carr[300:310:3,900:910:2] Out[75]: array([[ 0.5336495 , 0.55542123, 0.80049258, 0.84423071, 0.47674203], [ 0.93104523, 0.71216697, 0.23955345, 0.89759707, 0.70620197], [ 0.86999339, 0.05541291, 0.55156851, 0.96808773, 0.51768076], [ 0.29315394, 0.03837755, 0.33675179, 0.93591529, 0.99721605]]) Also, access to disk is very fast, even if you compressed your data: In [77]:tzlib=timeit.Timer("carr[300:310:3,900:910:2]","import tables;f=tables.openFile('/tmp/test-zlib.h5');carr=f.root.carr") In [78]:tzlib.repeat(3,100) Out[78]:[0.204339981079101, 0.176630973815917, 0.177133798599243] Compare these times with non-compressed data: In [80]:tnc=timeit.Timer("carr[300:310:3,900:910:2]","import tables;f=tables.openFile('/tmp/test-nocompr.h5');carr=f.root.carr") In [81]:tnc.repeat(3,100) Out[81]:[0.089105129241943, 0.084129095077514, 0.084383964538574219] That means that pytables can access data in the middle of a dataset without decompressing all the dataset, but just the interesting chunks (and you can decide the size of these chunks). You can see how the access times are in the range of milliseconds, irregardingly of the fact that the data is compressed or not. PyTables also does support others compressors apart from zlib, like bzip2 [2] or LZO [3], as well as compression pre-conditioners, like shuffle [4]. Look at the compression ratios for completely random data: In [84]:ls -l /tmp/test*.h5 -rw-r--r-- 1 faltet users 3675874 /tmp/test-bzip2-shuffle.h5 -rw-r--r-- 1 faltet users 3680615 /tmp/test-zlib-shuffle.h5 -rw-r--r-- 1 faltet users 3777749 /tmp/test-lzo-shuffle.h5 -rw-r--r-- 1 faltet users 8025024 /tmp/test-nocompr.h5 LZO is specially interesting if you want fast access to your data (it's very fast decompressing): In [82]:tlzo=timeit.Timer("carr[300:310:3,900:910:2]","import tables;f=tables.openFile('/tmp/test-lzo-shuffle.h5');carr=f.root.carr") In [83]:tlzo.repeat(3,100) Out[83]:[0.12332820892333984, 0.11892890930175781, 0.12009191513061523] So, retrieving compressed data using LZO is just 45% slower than if not using compression. You can see more exhaustive benchmarks and discussion in [5]. [1] http://www.pytables.org [2] http://www.bzip2.org [3] http://www.oberhumer.com/opensource/lzo [4] http://hdf.ncsa.uiuc.edu/HDF5/doc_resource/H5Shuffle_Perf.pdf [5] http://pytables.sourceforge.net/html-doc/usersguide6.html#section6.3 Uh, sorry by the blurb, but benchmarking is a lot of fun. --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
participants (6)
-
Francesc Altet
-
John Hunter
-
Joost van Evert
-
Perry Greenfield
-
skip@pobox.com
-
Warren Focke