[DB-SIG] Re: NumPy, Python DB-API and MySQL

Andy Dustman adustman@comstar.net
Mon, 10 Apr 2000 12:27:03 -0400 (EDT)


On Sun, 9 Apr 2000, Tim Churches wrote:

> I've been experimenting with pulling quantitative data out of a MySQL
> table into NumPy arrays via Andy Dustman's excellent MySQLdb module and
> then calculating various statistics from the data using Gary Strangman's
> excellent stats.py functions, which when operating on NumPy arrays are
> lightning-fast.
> 
> The problem is the speed with which data can be extracted from a column
> of a MySQL (or any other SQL database) query result set and stuffed into
> a NumPy array. This inevitably involves forming a Python list and then
> assigning that to a NumPy array. This is both slow and memory-hungry,
> especially with large datsets (I have een playing with a few million
> rows).
> 
> I was wondering if it would be feasible to initially add a method to the
> _mysql class in the MySQLdb module which iterated through a result set
> using a C routine (rather than a Python routine) and stuffed the data
> directly into a NumPy array (or arrays - one for each column in the
> result set) in one fell swoop (or even iterating row-by-row but in C)? I
> suspect that such a facility would be much faster than having to move
> the data into NumPy via a standard Python list (or actually via tuples
> within a list, which i sthe way the Python DB-API returns results).
> 
> If this direct MySQL-to-NumPy interface worked well, it might be
> desirable to add it to the Python DB-API specification for optional
> implementation in the other database modules which conform to the API.
> There are probably other extensions which would make the DB-API more
> useful for statistical applications, which tend to be set
> (column)-oriented rather than row-oriented - will post to the list as
> these occur to me.

It might be possible to do something like this. I would prefer that such a
feature work as a seperate module (i.e. I don't think it is generally
applicable to MySQLdb/_mysql). Or perhaps it could be a compile-time
option for _mysql (-DUSE_NUMPY).

The object that you want to mess with is the _mysql result object. It
contains an attribute MYSQL_RES *result, which is a pointer to the actual
MySQL structure. I don't remember if NumPy arrays are extensible or not,
i.e. can rows be appended? That would affect the design. If they are not
extensible, then you are probably limited to using mysql_store_result()
(result set stored on the client side), as opposed to mysql_use_result()
(result set stored on the server side). mysql_store_result is probably
preferable in this case anyway, so extensibility doesn't matter, as we can
find the size of the result set in advance with mysql_num_rows(). Then we
know the full size of the array.

However, with very large result sets, it may be necessary to use
mysql_use_result(), in which case the array will need to be extended,
possibly row-by-row.

I could do this, but I need to know how to create and assign values to a
NumPy array from within C. Or perhaps an initial (empty) array with the
correct number of columns can be passed. I am pretty sure NumPy arrays
look like sequences (of sequences), so assignment should not be a big
problem. Easiest solution (for me, and puts least bloat in _mysql) would
be for the user to pass in a NumPy array.

Question: Would it be adequate to put all columns returned into the array?
If label columns need to be returned, this could pose a problem. They may
have to be returned as a separate query. Or else non-numeric columns would
be excluded and returned in a list of tuples (this would be harder).

I suspect the existing cursor.executemany() is capable of INSERTing and
UPDATEing NumPy arrays.

-- 
andy dustman       |     programmer/analyst     |      comstar.net, inc.
telephone: 770.485.6025 / 706.549.7689 | icq: 32922760 | pgp: 0xc72f3f1d
"Therefore, sweet knights, if you may doubt your strength or courage, 
come no further, for death awaits you all, with nasty, big, pointy teeth!"