
I've been experimenting with pulling quantitative data out of a MySQL table into NumPy arrays via Andy Dustman's excellent MySQLdb module and then calculating various statistics from the data using Gary Strangman's excellent stats.py functions, which when operating on NumPy arrays are lightning-fast. The problem is the speed with which data can be extracted from a column of a MySQL (or any other SQL database) query result set and stuffed into a NumPy array. This inevitably involves forming a Python list and then assigning that to a NumPy array. This is both slow and memory-hungry, especially with large datsets (I have een playing with a few million rows). I was wondering if it would be feasible to initially add a method to the _mysql class in the MySQLdb module which iterated through a result set using a C routine (rather than a Python routine) and stuffed the data directly into a NumPy array (or arrays - one for each column in the result set) in one fell swoop (or even iterating row-by-row but in C)? I suspect that such a facility would be much faster than having to move the data into NumPy via a standard Python list (or actually via tuples within a list, which i sthe way the Python DB-API returns results). If this direct MySQL-to-NumPy interface worked well, it might be desirable to add it to the Python DB-API specification for optional implementation in the other database modules which conform to the API. There are probably other extensions which would make the DB-API more useful for statistical applications, which tend to be set (column)-oriented rather than row-oriented - will post to the list as these occur to me. Cheers, Tim Churches PS I will be away for the next week so apologies in advance for not replying immediately to any follow-ups to this posting. TC

On Sun, 9 Apr 2000, Tim Churches wrote:
It might be possible to do something like this. I would prefer that such a feature work as a seperate module (i.e. I don't think it is generally applicable to MySQLdb/_mysql). Or perhaps it could be a compile-time option for _mysql (-DUSE_NUMPY). The object that you want to mess with is the _mysql result object. It contains an attribute MYSQL_RES *result, which is a pointer to the actual MySQL structure. I don't remember if NumPy arrays are extensible or not, i.e. can rows be appended? That would affect the design. If they are not extensible, then you are probably limited to using mysql_store_result() (result set stored on the client side), as opposed to mysql_use_result() (result set stored on the server side). mysql_store_result is probably preferable in this case anyway, so extensibility doesn't matter, as we can find the size of the result set in advance with mysql_num_rows(). Then we know the full size of the array. However, with very large result sets, it may be necessary to use mysql_use_result(), in which case the array will need to be extended, possibly row-by-row. I could do this, but I need to know how to create and assign values to a NumPy array from within C. Or perhaps an initial (empty) array with the correct number of columns can be passed. I am pretty sure NumPy arrays look like sequences (of sequences), so assignment should not be a big problem. Easiest solution (for me, and puts least bloat in _mysql) would be for the user to pass in a NumPy array. Question: Would it be adequate to put all columns returned into the array? If label columns need to be returned, this could pose a problem. They may have to be returned as a separate query. Or else non-numeric columns would be excluded and returned in a list of tuples (this would be harder). I suspect the existing cursor.executemany() is capable of INSERTing and UPDATEing NumPy arrays. -- andy dustman | programmer/analyst | comstar.net, inc. telephone: 770.485.6025 / 706.549.7689 | icq: 32922760 | pgp: 0xc72f3f1d "Therefore, sweet knights, if you may doubt your strength or courage, come no further, for death awaits you all, with nasty, big, pointy teeth!"

Andy Dustman wrote:
The latter sounds good. I agree that most users of MySQLdb would not need it, so they shouldn't be burdened with it.
No they can't. I suspect that is the price to be paid for the efficient storage offered by NumPy arrays.
Yes, but the problem with mysql_store_result() is the large amount of memory required to store the result set. Couldn't the user be responsible for predetermining the size of the array via a query such as "select count(*) from sometable where...." and then pass this value as a parameter to the executeNumPy() method? In MySQL at least such count(*) queries are resolved very quickly so such an approach wouldn't take twice the time. Then mysql_use_result() could be used to populate the initialised NumPy array with data row, so there so only ever one complete copy of the data in memory, and that copy is in the NumPy array.
I'll look at the NumPy docs re this. Can any of the NumPy developers give some clues re this?
Yes, more thought needed here - my initial thought was one NumPy array per column, particularly since NumPy arrays must be homogenous wrt data type. Each NumPy array could be named the same as the column from which it is derived. Cheers, Tim C

On Fri, 14 Apr 2000, Tim Churches wrote:
After some more thought on this subject, and some poking around at NumPy, I came to the following conclusions: Since NumPy arrays are fixed-size, but otherwise sequences (in the multi-dimensional case, sequences of sequences), the best approach would be for the user to pass in a pre-sized array (i.e. from zeros(), and btw, the docstring for zeros is way wrong), and _mysql would simply access it through the Sequence object protocol, and update as many values as it could: If you passed a 100-row array, it would fill 100 rows or as many as were in the result set, whichever is less. Since this requires no special knowledge of NumPy, it could be a standard addition (no conditional compiliation required). This method (tentatively _mysql.fetch_rows_into_array(array)) would return the array argument as the result. IndexError would likely be raised if the array was too narrow (too many columns in result set). Probably this would not be a MySQLdb.Cursor method, but perhaps I can have a seperate module with a cursor subclass which returns NumPy arrays.
Okay, I think I know what you mean here. You are wanting to return each column as a (vertical) vector, whereas I am thinking along the lines of returning the result set as a matrix. Is that correct? Since it appears you can efficiently slice out column vectors as a[:,n], is my idea acceptable? i.e.
-- andy dustman | programmer/analyst | comstar.net, inc. telephone: 770.485.6025 / 706.549.7689 | icq: 32922760 | pgp: 0xc72f3f1d "Therefore, sweet knights, if you may doubt your strength or courage, come no further, for death awaits you all, with nasty, big, pointy teeth!"

Andy Dustman wrote: [...snip...]
Yes, exactly.
The only problem is that NumPy arrays must be homogeneous wrt type, which means that, say, a categorical column containing just a few distinct values stored as an integer would have to be upcast to a double in the NumPy matrix if it was part of a query which also returned a float. Would it be possible to extend your idea of passing in an array to the query? Perhaps the user could pass in a list of pre-existing, pre-sized sequence objects (which might be rank-1 NumPy arrays of various appropriate data types or Python tuples) which correspond to the columns which are to be returned by the SQL query. It would be up to the user to determine the correct type for each NumPy array and to size the array or tuples correctly. The reason for needing tuples as well as NumPy arrays is that, as you mention, NumPy arrays only support numbers. The intention would be for all of this to be wrapped in a class which may issue a number of small queries to determine the number of rows to be returned and the data types of the columns, so the user is shielded from having to work out these details. The only bit that has to be written in C is the function which takes the sequence of sequences (NumPy Arrays or Python tuples) in which to store the query results, column-wise and stuffs the value for each column for each row of the result set into the appropriate passed-in sequence object. I would be more than happy to assist with the Python code, testing and documentation but my C skills aren't up to helping with the guts of it. In other words, making this part of the low-level _mysql interface would be sufficient. Cheers, Tim C

On Sun, 9 Apr 2000, Tim Churches wrote:
It might be possible to do something like this. I would prefer that such a feature work as a seperate module (i.e. I don't think it is generally applicable to MySQLdb/_mysql). Or perhaps it could be a compile-time option for _mysql (-DUSE_NUMPY). The object that you want to mess with is the _mysql result object. It contains an attribute MYSQL_RES *result, which is a pointer to the actual MySQL structure. I don't remember if NumPy arrays are extensible or not, i.e. can rows be appended? That would affect the design. If they are not extensible, then you are probably limited to using mysql_store_result() (result set stored on the client side), as opposed to mysql_use_result() (result set stored on the server side). mysql_store_result is probably preferable in this case anyway, so extensibility doesn't matter, as we can find the size of the result set in advance with mysql_num_rows(). Then we know the full size of the array. However, with very large result sets, it may be necessary to use mysql_use_result(), in which case the array will need to be extended, possibly row-by-row. I could do this, but I need to know how to create and assign values to a NumPy array from within C. Or perhaps an initial (empty) array with the correct number of columns can be passed. I am pretty sure NumPy arrays look like sequences (of sequences), so assignment should not be a big problem. Easiest solution (for me, and puts least bloat in _mysql) would be for the user to pass in a NumPy array. Question: Would it be adequate to put all columns returned into the array? If label columns need to be returned, this could pose a problem. They may have to be returned as a separate query. Or else non-numeric columns would be excluded and returned in a list of tuples (this would be harder). I suspect the existing cursor.executemany() is capable of INSERTing and UPDATEing NumPy arrays. -- andy dustman | programmer/analyst | comstar.net, inc. telephone: 770.485.6025 / 706.549.7689 | icq: 32922760 | pgp: 0xc72f3f1d "Therefore, sweet knights, if you may doubt your strength or courage, come no further, for death awaits you all, with nasty, big, pointy teeth!"

Andy Dustman wrote:
The latter sounds good. I agree that most users of MySQLdb would not need it, so they shouldn't be burdened with it.
No they can't. I suspect that is the price to be paid for the efficient storage offered by NumPy arrays.
Yes, but the problem with mysql_store_result() is the large amount of memory required to store the result set. Couldn't the user be responsible for predetermining the size of the array via a query such as "select count(*) from sometable where...." and then pass this value as a parameter to the executeNumPy() method? In MySQL at least such count(*) queries are resolved very quickly so such an approach wouldn't take twice the time. Then mysql_use_result() could be used to populate the initialised NumPy array with data row, so there so only ever one complete copy of the data in memory, and that copy is in the NumPy array.
I'll look at the NumPy docs re this. Can any of the NumPy developers give some clues re this?
Yes, more thought needed here - my initial thought was one NumPy array per column, particularly since NumPy arrays must be homogenous wrt data type. Each NumPy array could be named the same as the column from which it is derived. Cheers, Tim C

On Fri, 14 Apr 2000, Tim Churches wrote:
After some more thought on this subject, and some poking around at NumPy, I came to the following conclusions: Since NumPy arrays are fixed-size, but otherwise sequences (in the multi-dimensional case, sequences of sequences), the best approach would be for the user to pass in a pre-sized array (i.e. from zeros(), and btw, the docstring for zeros is way wrong), and _mysql would simply access it through the Sequence object protocol, and update as many values as it could: If you passed a 100-row array, it would fill 100 rows or as many as were in the result set, whichever is less. Since this requires no special knowledge of NumPy, it could be a standard addition (no conditional compiliation required). This method (tentatively _mysql.fetch_rows_into_array(array)) would return the array argument as the result. IndexError would likely be raised if the array was too narrow (too many columns in result set). Probably this would not be a MySQLdb.Cursor method, but perhaps I can have a seperate module with a cursor subclass which returns NumPy arrays.
Okay, I think I know what you mean here. You are wanting to return each column as a (vertical) vector, whereas I am thinking along the lines of returning the result set as a matrix. Is that correct? Since it appears you can efficiently slice out column vectors as a[:,n], is my idea acceptable? i.e.
-- andy dustman | programmer/analyst | comstar.net, inc. telephone: 770.485.6025 / 706.549.7689 | icq: 32922760 | pgp: 0xc72f3f1d "Therefore, sweet knights, if you may doubt your strength or courage, come no further, for death awaits you all, with nasty, big, pointy teeth!"

Andy Dustman wrote: [...snip...]
Yes, exactly.
The only problem is that NumPy arrays must be homogeneous wrt type, which means that, say, a categorical column containing just a few distinct values stored as an integer would have to be upcast to a double in the NumPy matrix if it was part of a query which also returned a float. Would it be possible to extend your idea of passing in an array to the query? Perhaps the user could pass in a list of pre-existing, pre-sized sequence objects (which might be rank-1 NumPy arrays of various appropriate data types or Python tuples) which correspond to the columns which are to be returned by the SQL query. It would be up to the user to determine the correct type for each NumPy array and to size the array or tuples correctly. The reason for needing tuples as well as NumPy arrays is that, as you mention, NumPy arrays only support numbers. The intention would be for all of this to be wrapped in a class which may issue a number of small queries to determine the number of rows to be returned and the data types of the columns, so the user is shielded from having to work out these details. The only bit that has to be written in C is the function which takes the sequence of sequences (NumPy Arrays or Python tuples) in which to store the query results, column-wise and stuffs the value for each column for each row of the result set into the appropriate passed-in sequence object. I would be more than happy to assist with the Python code, testing and documentation but my C skills aren't up to helping with the guts of it. In other words, making this part of the low-level _mysql interface would be sufficient. Cheers, Tim C
participants (2)
-
Andy Dustman
-
Tim Churches