Re: [Numpy-discussion] Pickle, pytables, and sqlite - loading and saving recarray's

20 Jul 2007

      A Divendres 20 Juliol 2007 04:42, Vincent Nijs escrigué:
...
I am interesting in using sqlite (or pytables) to store data for scientific
research. I wrote the attached test program to save and load a simulated
11x500,000 recarray. Average save and load times are given below (timeit
with 20 repetitions). The save time for sqlite is not really fair because I
have to delete the data table each time before I create the new one. It is
still pretty slow in comparison. Loading the recarray from sqlite is
significantly slower than pytables or cPickle. I am hoping there may be
more efficient ways to save and load recarray¹s from/to sqlite than what I
am now doing. Note that I infer the variable names and types from the data
rather than specifying them manually.
I¹d luv to hear from people using sqlite, pytables, and cPickle about their
experiences.
saving recarray with cPickle:       1.448568 sec/pass
saving recarray with pytable:      3.437228 sec/pass
saving recarray with sqlite:         193.286204 sec/pass
loading recarray using cPickle:    0.471365 sec/pass
loading recarray with pytable:     0.692838 sec/pass
loading recarray with sqlite:        15.977018 sec/pass
For a more fair comparison, and for large amounts of data, you should inform 
PyTables about the expected number of rows (see [1]) that you will end 
feeding into the tables so that it can choose the best chunksize for I/O 
purposes.

I've redone the benchmarks (the new script is attached) with 
this 'optimization' on and here are my numbers:

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
PyTables version:  2.0
HDF5 version:      1.6.5
NumPy version:     1.0.3
Zlib version:      1.2.3
LZO version:       2.01 (Jun 27 2005)
Python version:    2.5 (r25:51908, Nov  3 2006, 12:01:01)
[GCC 4.0.2 20050901 (prerelease) (SUSE Linux)]
Platform:          linux2-x86_64
Byte-ordering:     little
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Test saving recarray using cPickle: 0.197113 sec/pass
Test saving recarray with pytables: 0.234442 sec/pass
Test saving recarray with pytables (with zlib): 1.973649 sec/pass
Test saving recarray with pytables (with lzo): 0.925558 sec/pass

Test loading recarray using cPickle: 0.151379 sec/pass
Test loading recarray with pytables: 0.165399 sec/pass
Test loading recarray with pytables (with zlib): 0.553251 sec/pass
Test loading recarray with pytables (with lzo): 0.264417 sec/pass

As you can see, the differences between raw cPickle and PyTables are much less 
than not informing about the total number of rows.  In fact, an automatic 
optimization can easily be done in PyTables so that when the user is passing 
a recarray, the total length of the recarray would be compared with the 
default number of expected rows (currently 10000), and if the former is 
larger, then the length of the recarray should be chosen instead.

I also have added the times when using compression just in case you are 
interested using it.  Here are the final file sizes:

$ ls -sh data
total 132M
24M data-lzo.h5  43M data-None.h5  43M data.pickle  25M data-zlib.h5

Of course, this is using completely random data, but with real data the 
compression levels are expected to be higher than this.

[1] http://www.pytables.org/docs/manual/ch05.html#expectedRowsOptim

Cheers,

--
...
0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"