[Numpy-discussion] Pickle, pytables, and sqlite - loading and saving recarray's

Vincent Nijs v-nijs at kellogg.northwestern.edu
Sun Jul 22 11:21:18 EDT 2007


I asked a question about the load and save speed of recarray's using pickle
vs pysqlite on the pysqlite list and got the response linked below. Doesn't
look like sqlite can do much better than what I found.


I also passed on Francesc's idea to use numpy containers in relational
database wrappers such as pysqlite. This is apparently not possible since
"in a "relational database you don't know the type of the values in advance.
Some values might be NULL" and "and you might even have different types for
the same column"


I would assume the NULL's could be treated as missing values (?) Don't know
about the different types in one column however.


On 7/20/07 10:53 AM, "Francesc Altet" <faltet at carabos.com> wrote:

> Vincent,
> A Divendres 20 Juliol 2007 15:35, Vincent Nijs escrigué:
>> Still curious however ... does no one on this list use (and like) sqlite?
> First of all, while I'm not a heavy user of relational databases, I've used
> them as references for benchmarking purposes.  Hence, based on my own
> benchmarking experience, I'd say that, for writing, relational databases do
> take a lot of safety measures to ensure that all the data that is written to
> the disk is safe and that the data relationships don't get broken, and that
> takes times (a lot of time, in fact).   I'm not sure about whether some of
> these safety measures can be relaxed, but even though some relational
> databases would allow this, my feel (beware, I can be wrong) is that you
> won't be able to reach cPickle/PyTables speed (cPickle/PyTables are not
> observing security measures in that regard because they are not thought for
> these tasks).
> In this sense, the best writing  speed that I was able to achieve with
> Postgres (I don't know whether sqlite support this) is by simulating that
> your data comes from a file stream and using the "cursor.copy_from()" method.
> Using this approach I was able to accelerate a 10x (if I remember well) the
> injecting speed, but even with this, PyTables can be another 10x faster. You
> can see an exemple of usage in the Postgres backend [1] used for doing the
> benchmarks for comparing PyTables and Postgres speeds.
> Regarding reading speed, my diggins [2] seems to indicate that the bottleneck
> here is not related with safety, but with the need of the relational
> databases pythonic APIs of wrapping *every* element retrieved out of the
> database with a Python container (int, float, string...).  On the contrary,
> PyTables does take advantage of creating an empty recarray as the container
> to keep all the retrieved data, and that's very fast compared with the former
> approach.  To somewhat quantify this effect in function of the size of the
> dataset retrieved, you can see the figure 14 of [3] (as you can see, the
> larger the dataset retrieved, the larger the difference in terms of speed).
> Incidentally, and as it is said there, I'm hoping that NumPy containers
> should eventually be discovered by relational database wrappers makers, so
> these wrapping times would be removed completely, but I'm currently not aware
> of any package taking this approach.
> [1] http://www.pytables.org/trac/browser/trunk/bench/postgres_backend.py
> [2] http://thread.gmane.org/gmane.comp.python.numeric.general/9704
> [3] http://www.carabos.com/docs/OPSI-indexes.pdf
> Cheers,

More information about the NumPy-Discussion mailing list