mysql -> record array

Francesc Altet faltet at carabos.com
Thu Nov 16 16:14:21 EST 2006


Hi,

Well, lets see whether this shiny new list works :)

El dj 16 de 11 del 2006 a les 11:05 -0700, en/na Tim Hochberg va
escriure:
> One weakness of this benchmark is that it doesn't break out how much of 
> the sqlite3 overhead is inherent to the sqlite3 engine, which I expect 
> is somewhat more complicated internally than PyTables, and how much is 
> due to all the extra layers we go through to get the data into an array 
> (native[in database]->Python Objects->Native[In record array]). To try 
> to get at least a little handle on this, I add this test:
> 
>     def querySQLite(conn):
>         c = conn.cursor()
>         c.execute('select * from demo where x = 0.0')
>         y = np.fromiter(c, dtype=dtype)
>         return y
> 
> This returns very little data (in the cases I ran it actually returned 
> no data). However is still needs to loop over all the records and 
> examine them. Here's what the timings looked like:
> 
>     setup SQLite took 9.71799993515 seconds
>     retrieve SQLite took 0.921999931335 seconds
>     query SQLite took 0.313000202179 seconds
> 
> I'm reluctant to conclude to conclude that 1/3 of the time is spent in 
> traversing the database and 2/3 of the time in creating the data solely 
> because databases are big voodoo to me. Still, we can probably conclude 
> that traversing the data itself is pretty expensive and we would be 
> unlikely to approach PyTables speed even if we didn't have the extra 
> overhead. On the other hand, there's a factor of three or so improvement 
> that could be realized by reducing overhead.
> 
> Or maybe not. I think that the database has to return it's data a row at 
> a time, so there's intrinsically a lot of copying that's going to 
> happen. So, I think it's unclear whether getting the data directly in 
> native format would be significantly cheaper. I suppose that the way to 
> definitively test it would be to rewrite one of these tests in C. Any 
> volunteers?

I don't think that doing this in C is going to be necessary at all.
Sqlite3 has a console that allows doing SQL operations without Python in
the middle.

I've included your new function and here are my results for Python (just
for comparison purposes):

setup SQLite took 35.9286799431 seconds
retrieve SQLite took 4.66451621056 seconds
query SQLite took 0.907495975494 seconds

Now, the query from the console:

$ time sqlite3 /tmp/test.sql3 "select * from demo where x == 0.0"

real    0m1.226s
user    0m0.792s
sys     0m0.120s

If you sum cpu user time and sys time, you get 0.9 seconds (roughly)
which is approximately the same than reported in Python. So, it doesn't
seem that for this specific case the python interface is taking a
significant amount of time (as expected).

For a complete retrieval, we can do the same trick, but redirectioning
to /dev/null to avoid writing the output to another device:

time sqlite3 /tmp/test.sql3 "select * from demo" > /dev/null

real    0m3.690s
user    0m3.376s
sys     0m0.188s

In this case, the time sums up around 3.5 seconds, while the Python time
is 4.6 seconds. However, the console has the problem that it has to
format every row to a string before to "write" to /dev/null. So, I'd say
that the time for a retrieval from C program should be significantly
less than 3.5 (provided that there are an appropriate container
in-memory for keeping the data). So, the bottleneck does effectively
seem to be the conversion from list of lists to a recarray.

Just out of curiosity, here it is the equivalent where function in
PyTables:

def queryPT(fileh):
    t = fileh.root.table
    y = np.fromiter(((r['x'], r['y']) for r in t.where(t.cols.x == 0)),
                    dtype=dtype)
    return y

and its times:

setup PyTables took 0.453132867813 seconds
retrieve PyTables took 0.356657981873 seconds
query PyTables took 0.253107070923 seconds

PyTables is more than 3x faster than SQLite for this specific query. I
guess this is because the selects are done completely through numarray
(in PyTables 2.x selections will be done through numexpr), and it seems
difficult to bet such a 'vectorized' selects. So, using a package like
numarray or NumPy in the core of a database has a lot of advantages,
most specially when speed matters.

Cheers,

-- 
Francesc Altet    |  Be careful about using the following code --
Carabos Coop. V.  |  I've only proven that it works, 
www.carabos.com   |  I haven't tested it. -- Donald Knuth




More information about the NumPy-Discussion mailing list