A Divendres 20 Juliol 2007 04:42, Vincent Nijs escrigué:
I am interesting in using sqlite (or pytables) to store data for scientific research. I wrote the attached test program to save and load a simulated 11x500,000 recarray. Average save and load times are given below (timeit with 20 repetitions). The save time for sqlite is not really fair because I have to delete the data table each time before I create the new one. It is still pretty slow in comparison. Loading the recarray from sqlite is significantly slower than pytables or cPickle. I am hoping there may be more efficient ways to save and load recarray¹s from/to sqlite than what I am now doing. Note that I infer the variable names and types from the data rather than specifying them manually.
I¹d luv to hear from people using sqlite, pytables, and cPickle about their experiences.
saving recarray with cPickle: 1.448568 sec/pass saving recarray with pytable: 3.437228 sec/pass saving recarray with sqlite: 193.286204 sec/pass
loading recarray using cPickle: 0.471365 sec/pass loading recarray with pytable: 0.692838 sec/pass loading recarray with sqlite: 15.977018 sec/pass
For a more fair comparison, and for large amounts of data, you should inform PyTables about the expected number of rows (see [1]) that you will end feeding into the tables so that it can choose the best chunksize for I/O purposes. I've redone the benchmarks (the new script is attached) with this 'optimization' on and here are my numbers: -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= PyTables version: 2.0 HDF5 version: 1.6.5 NumPy version: 1.0.3 Zlib version: 1.2.3 LZO version: 2.01 (Jun 27 2005) Python version: 2.5 (r25:51908, Nov 3 2006, 12:01:01) [GCC 4.0.2 20050901 (prerelease) (SUSE Linux)] Platform: linux2-x86_64 Byte-ordering: little -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test saving recarray using cPickle: 0.197113 sec/pass Test saving recarray with pytables: 0.234442 sec/pass Test saving recarray with pytables (with zlib): 1.973649 sec/pass Test saving recarray with pytables (with lzo): 0.925558 sec/pass Test loading recarray using cPickle: 0.151379 sec/pass Test loading recarray with pytables: 0.165399 sec/pass Test loading recarray with pytables (with zlib): 0.553251 sec/pass Test loading recarray with pytables (with lzo): 0.264417 sec/pass As you can see, the differences between raw cPickle and PyTables are much less than not informing about the total number of rows. In fact, an automatic optimization can easily be done in PyTables so that when the user is passing a recarray, the total length of the recarray would be compared with the default number of expected rows (currently 10000), and if the former is larger, then the length of the recarray should be chosen instead. I also have added the times when using compression just in case you are interested using it. Here are the final file sizes: $ ls -sh data total 132M 24M data-lzo.h5 43M data-None.h5 43M data.pickle 25M data-zlib.h5 Of course, this is using completely random data, but with real data the compression levels are expected to be higher than this. [1] http://www.pytables.org/docs/manual/ch05.html#expectedRowsOptim Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"