[Numpy-discussion] Pickle, pytables, and sqlite - loading and saving recarray's

Fri Jul 20 11:53:09 EDT 2007

Vincent,

A Divendres 20 Juliol 2007 15:35, Vincent Nijs escrigué:
> Still curious however ... does no one on this list use (and like) sqlite?

First of all, while I'm not a heavy user of relational databases, I've used 
them as references for benchmarking purposes.  Hence, based on my own 
benchmarking experience, I'd say that, for writing, relational databases do 
take a lot of safety measures to ensure that all the data that is written to 
the disk is safe and that the data relationships don't get broken, and that 
takes times (a lot of time, in fact).   I'm not sure about whether some of 
these safety measures can be relaxed, but even though some relational 
databases would allow this, my feel (beware, I can be wrong) is that you 
won't be able to reach cPickle/PyTables speed (cPickle/PyTables are not 
observing security measures in that regard because they are not thought for 
these tasks).

In this sense, the best writing  speed that I was able to achieve with 
Postgres (I don't know whether sqlite support this) is by simulating that 
your data comes from a file stream and using the "cursor.copy_from()" method.  
Using this approach I was able to accelerate a 10x (if I remember well) the 
injecting speed, but even with this, PyTables can be another 10x faster. You 
can see an exemple of usage in the Postgres backend [1] used for doing the 
benchmarks for comparing PyTables and Postgres speeds.

Regarding reading speed, my diggins [2] seems to indicate that the bottleneck 
here is not related with safety, but with the need of the relational 
databases pythonic APIs of wrapping *every* element retrieved out of the 
database with a Python container (int, float, string...).  On the contrary, 
PyTables does take advantage of creating an empty recarray as the container 
to keep all the retrieved data, and that's very fast compared with the former 
approach.  To somewhat quantify this effect in function of the size of the 
dataset retrieved, you can see the figure 14 of [3] (as you can see, the 
larger the dataset retrieved, the larger the difference in terms of speed).  
Incidentally, and as it is said there, I'm hoping that NumPy containers 
should eventually be discovered by relational database wrappers makers, so 
these wrapping times would be removed completely, but I'm currently not aware 
of any package taking this approach.

[1] http://www.pytables.org/trac/browser/trunk/bench/postgres_backend.py
[2] http://thread.gmane.org/gmane.comp.python.numeric.general/9704
[3] http://www.carabos.com/docs/OPSI-indexes.pdf

Cheers,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"