I've been experimenting with pulling quantitative data out of a MySQL table into NumPy arrays via Andy Dustman's excellent MySQLdb module and then calculating various statistics from the data using Gary Strangman's excellent stats.py functions, which when operating on NumPy arrays are lightningfast.
The problem is the speed with which data can be extracted from a column of a MySQL (or any other SQL database) query result set and stuffed into a NumPy array. This inevitably involves forming a Python list and then assigning that to a NumPy array. This is both slow and memoryhungry, especially with large datsets (I have een playing with a few million rows).
I was wondering if it would be feasible to initially add a method to the _mysql class in the MySQLdb module which iterated through a result set using a C routine (rather than a Python routine) and stuffed the data directly into a NumPy array (or arrays  one for each column in the result set) in one fell swoop (or even iterating rowbyrow but in C)? I suspect that such a facility would be much faster than having to move the data into NumPy via a standard Python list (or actually via tuples within a list, which i sthe way the Python DBAPI returns results).
If this direct MySQLtoNumPy interface worked well, it might be desirable to add it to the Python DBAPI specification for optional implementation in the other database modules which conform to the API. There are probably other extensions which would make the DBAPI more useful for statistical applications, which tend to be set (column)oriented rather than roworiented  will post to the list as these occur to me.
Cheers,
Tim Churches
PS I will be away for the next week so apologies in advance for not replying immediately to any followups to this posting.
TC
On Sun, 9 Apr 2000, Tim Churches wrote:
I've been experimenting with pulling quantitative data out of a MySQL table into NumPy arrays via Andy Dustman's excellent MySQLdb module and then calculating various statistics from the data using Gary Strangman's excellent stats.py functions, which when operating on NumPy arrays are lightningfast.
The problem is the speed with which data can be extracted from a column of a MySQL (or any other SQL database) query result set and stuffed into a NumPy array. This inevitably involves forming a Python list and then assigning that to a NumPy array. This is both slow and memoryhungry, especially with large datsets (I have een playing with a few million rows).
I was wondering if it would be feasible to initially add a method to the _mysql class in the MySQLdb module which iterated through a result set using a C routine (rather than a Python routine) and stuffed the data directly into a NumPy array (or arrays  one for each column in the result set) in one fell swoop (or even iterating rowbyrow but in C)? I suspect that such a facility would be much faster than having to move the data into NumPy via a standard Python list (or actually via tuples within a list, which i sthe way the Python DBAPI returns results).
If this direct MySQLtoNumPy interface worked well, it might be desirable to add it to the Python DBAPI specification for optional implementation in the other database modules which conform to the API. There are probably other extensions which would make the DBAPI more useful for statistical applications, which tend to be set (column)oriented rather than roworiented  will post to the list as these occur to me.
It might be possible to do something like this. I would prefer that such a feature work as a seperate module (i.e. I don't think it is generally applicable to MySQLdb/_mysql). Or perhaps it could be a compiletime option for _mysql (DUSE_NUMPY).
The object that you want to mess with is the _mysql result object. It contains an attribute MYSQL_RES *result, which is a pointer to the actual MySQL structure. I don't remember if NumPy arrays are extensible or not, i.e. can rows be appended? That would affect the design. If they are not extensible, then you are probably limited to using mysql_store_result() (result set stored on the client side), as opposed to mysql_use_result() (result set stored on the server side). mysql_store_result is probably preferable in this case anyway, so extensibility doesn't matter, as we can find the size of the result set in advance with mysql_num_rows(). Then we know the full size of the array.
However, with very large result sets, it may be necessary to use mysql_use_result(), in which case the array will need to be extended, possibly rowbyrow.
I could do this, but I need to know how to create and assign values to a NumPy array from within C. Or perhaps an initial (empty) array with the correct number of columns can be passed. I am pretty sure NumPy arrays look like sequences (of sequences), so assignment should not be a big problem. Easiest solution (for me, and puts least bloat in _mysql) would be for the user to pass in a NumPy array.
Question: Would it be adequate to put all columns returned into the array? If label columns need to be returned, this could pose a problem. They may have to be returned as a separate query. Or else nonnumeric columns would be excluded and returned in a list of tuples (this would be harder).
I suspect the existing cursor.executemany() is capable of INSERTing and UPDATEing NumPy arrays.
Andy Dustman wrote:
On Sun, 9 Apr 2000, Tim Churches wrote:
I've been experimenting with pulling quantitative data out of a MySQL table into NumPy arrays via Andy Dustman's excellent MySQLdb module and then calculating various statistics from the data using Gary Strangman's excellent stats.py functions, which when operating on NumPy arrays are lightningfast. [...snip...]
It might be possible to do something like this. I would prefer that such a feature work as a seperate module (i.e. I don't think it is generally applicable to MySQLdb/_mysql). Or perhaps it could be a compiletime option for _mysql (DUSE_NUMPY).
The latter sounds good. I agree that most users of MySQLdb would not need it, so they shouldn't be burdened with it.
The object that you want to mess with is the _mysql result object. It contains an attribute MYSQL_RES *result, which is a pointer to the actual MySQL structure. I don't remember if NumPy arrays are extensible or not, i.e. can rows be appended?
No they can't. I suspect that is the price to be paid for the efficient storage offered by NumPy arrays.
That would affect the design. If they are not
extensible, then you are probably limited to using mysql_store_result() (result set stored on the client side), as opposed to mysql_use_result() (result set stored on the server side). mysql_store_result is probably preferable in this case anyway, so extensibility doesn't matter, as we can find the size of the result set in advance with mysql_num_rows(). Then we know the full size of the array.
Yes, but the problem with mysql_store_result() is the large amount of memory required to store the result set. Couldn't the user be responsible for predetermining the size of the array via a query such as "select count(*) from sometable where...." and then pass this value as a parameter to the executeNumPy() method? In MySQL at least such count(*) queries are resolved very quickly so such an approach wouldn't take twice the time. Then mysql_use_result() could be used to populate the initialised NumPy array with data row, so there so only ever one complete copy of the data in memory, and that copy is in the NumPy array.
However, with very large result sets, it may be necessary to use mysql_use_result(), in which case the array will need to be extended, possibly rowbyrow.
I could do this, but I need to know how to create and assign values to a NumPy array from within C. Or perhaps an initial (empty) array with the correct number of columns can be passed. I am pretty sure NumPy arrays look like sequences (of sequences), so assignment should not be a big problem. Easiest solution (for me, and puts least bloat in _mysql) would be for the user to pass in a NumPy array.
I'll look at the NumPy docs re this. Can any of the NumPy developers give some clues re this?
Question: Would it be adequate to put all columns returned into the array? If label columns need to be returned, this could pose a problem. They may have to be returned as a separate query. Or else nonnumeric columns would be excluded and returned in a list of tuples (this would be harder).
Yes, more thought needed here  my initial thought was one NumPy array per column, particularly since NumPy arrays must be homogenous wrt data type. Each NumPy array could be named the same as the column from which it is derived.
Cheers,
Tim C
On Fri, 14 Apr 2000, Tim Churches wrote:
Andy Dustman wrote:
Yes, but the problem with mysql_store_result() is the large amount of memory required to store the result set. Couldn't the user be responsible for predetermining the size of the array via a query such as "select count(*) from sometable where...." and then pass this value as a parameter to the executeNumPy() method? In MySQL at least such count(*) queries are resolved very quickly so such an approach wouldn't take twice the time. Then mysql_use_result() could be used to populate the initialised NumPy array with data row, so there so only ever one complete copy of the data in memory, and that copy is in the NumPy array.
After some more thought on this subject, and some poking around at NumPy, I came to the following conclusions:
Since NumPy arrays are fixedsize, but otherwise sequences (in the multidimensional case, sequences of sequences), the best approach would be for the user to pass in a presized array (i.e. from zeros(), and btw, the docstring for zeros is way wrong), and _mysql would simply access it through the Sequence object protocol, and update as many values as it could: If you passed a 100row array, it would fill 100 rows or as many as were in the result set, whichever is less.
Since this requires no special knowledge of NumPy, it could be a standard addition (no conditional compiliation required). This method (tentatively _mysql.fetch_rows_into_array(array)) would return the array argument as the result. IndexError would likely be raised if the array was too narrow (too many columns in result set). Probably this would not be a MySQLdb.Cursor method, but perhaps I can have a seperate module with a cursor subclass which returns NumPy arrays.
Question: Would it be adequate to put all columns returned into the array? If label columns need to be returned, this could pose a problem. They may have to be returned as a separate query. Or else nonnumeric columns would be excluded and returned in a list of tuples (this would be harder).
Yes, more thought needed here  my initial thought was one NumPy array per column, particularly since NumPy arrays must be homogenous wrt data type. Each NumPy array could be named the same as the column from which it is derived.
Okay, I think I know what you mean here. You are wanting to return each column as a (vertical) vector, whereas I am thinking along the lines of returning the result set as a matrix. Is that correct? Since it appears you can efficiently slice out column vectors as a[:,n], is my idea acceptable? i.e.
a=Numeric.multiarray.zeros( (2,2),'d') a[1,1]=2 a[0,1]=1 a[1,0]=3 a
array([[ 0., 1.], [3., 2.]])
a[:,0]
array([ 0., 3.])
a[:,1]
array([1., 2.])
Andy Dustman wrote: [...snip...]
Okay, I think I know what you mean here. You are wanting to return each column as a (vertical) vector, whereas I am thinking along the lines of returning the result set as a matrix. Is that correct?
Yes, exactly.
Since it appears
you can efficiently slice out column vectors as a[:,n], is my idea acceptable? i.e.
a=Numeric.multiarray.zeros( (2,2),'d') a[1,1]=2 a[0,1]=1 a[1,0]=3 a
array([[ 0., 1.], [3., 2.]])
a[:,0]
array([ 0., 3.])
a[:,1]
array([1., 2.])
The only problem is that NumPy arrays must be homogeneous wrt type, which means that, say, a categorical column containing just a few distinct values stored as an integer would have to be upcast to a double in the NumPy matrix if it was part of a query which also returned a float.
Would it be possible to extend your idea of passing in an array to the query? Perhaps the user could pass in a list of preexisting, presized sequence objects (which might be rank1 NumPy arrays of various appropriate data types or Python tuples) which correspond to the columns which are to be returned by the SQL query. It would be up to the user to determine the correct type for each NumPy array and to size the array or tuples correctly. The reason for needing tuples as well as NumPy arrays is that, as you mention, NumPy arrays only support numbers. The intention would be for all of this to be wrapped in a class which may issue a number of small queries to determine the number of rows to be returned and the data types of the columns, so the user is shielded from having to work out these details. The only bit that has to be written in C is the function which takes the sequence of sequences (NumPy Arrays or Python tuples) in which to store the query results, columnwise and stuffs the value for each column for each row of the result set into the appropriate passedin sequence object. I would be more than happy to assist with the Python code, testing and documentation but my C skills aren't up to helping with the guts of it. In other words, making this part of the lowlevel _mysql interface would be sufficient.
Cheers,
Tim C
 andy dustman  programmer/analyst  comstar.net, inc. telephone: 770.485.6025 / 706.549.7689  icq: 32922760  pgp: 0xc72f3f1d "Therefore, sweet knights, if you may doubt your strength or courage, come no further, for death awaits you all, with nasty, big, pointy teeth!"
participants (2)

Andy Dustman

Tim Churches