Anyone written an SQL-like interface to numpy/PyTables?
Hi,
Has anyone written a parser for SQL-like queries against PyTables HDF
tables or numpy recarrays?
I'm asking because I have written code for grouping then summing rows of
source data, where the groups are defined by functions of the source
data, or looking up a related field in a separate lookup tables. I use
this for tracking the performance of customer segments by status, with
48 monthly files of product usage/customer status data on 5m customers.
Each of these is a 5m row HDF file, with several other 5m row HDF files
that are used to work out which segment a customer belongs to.
This grouping and summing is equivalent to something like the following
SQL code:
SELECT
grp_fn1(table1.*), grp_fn2(table1.*), grp_fn3(table2.*),
grp_fn4(table3.*),
count(table1.*),
sum(table1.field1), sum(table1.field2), ..., sum(table1.fieldK)
FROM table1 PARTITION(date)
LEFT JOIN table2 ON table1.field0=table2.field0
LEFT JOIN table3 ON table2.field0=table3.field0
WHERE min_date<=date<=max_date
GROUP BY grp_fn1(table1.*), grp_fn2(table1.*), grp_fn3(table2.*),
grp_fn4(table3.*)
I'm using numpy.bincount() function to do the grouping/summing,
numpy.searchsorted() for fast lookup tables implementing the grouping
functions grp_fn(), and some other C functions for a fast "zip" join of
related tables whose primary keys are in the same order as the monthly
date partitions.
The Python code that specifies the grouping/summing fields looks like this:
agg = HDFAggregator('table1.hdf')
agg.add_group_function('MONTH', ...
I've often thought it would be interesting if someone would build a custom
table adapter to use PyTables in SQLlite. Ie, essentially bolting a SQL
parser and query engine on top of PyTables. Unfortunately, I don't have
time to do this, though hopefully someone will at some point.
-Kevin
On 6/18/07, Stephen Simmons
Hi,
Has anyone written a parser for SQL-like queries against PyTables HDF tables or numpy recarrays?
I'm asking because I have written code for grouping then summing rows of source data, where the groups are defined by functions of the source data, or looking up a related field in a separate lookup tables. I use this for tracking the performance of customer segments by status, with 48 monthly files of product usage/customer status data on 5m customers. Each of these is a 5m row HDF file, with several other 5m row HDF files that are used to work out which segment a customer belongs to.
This grouping and summing is equivalent to something like the following SQL code: SELECT grp_fn1(table1.*), grp_fn2(table1.*), grp_fn3(table2.*), grp_fn4(table3.*), count(table1.*), sum(table1.field1), sum(table1.field2), ..., sum(table1.fieldK) FROM table1 PARTITION(date) LEFT JOIN table2 ON table1.field0=table2.field0 LEFT JOIN table3 ON table2.field0=table3.field0 WHERE min_date<=date<=max_date GROUP BY grp_fn1(table1.*), grp_fn2(table1.*), grp_fn3(table2.*), grp_fn4(table3.*)
I'm using numpy.bincount() function to do the grouping/summing, numpy.searchsorted() for fast lookup tables implementing the grouping functions grp_fn(), and some other C functions for a fast "zip" join of related tables whose primary keys are in the same order as the monthly date partitions.
The Python code that specifies the grouping/summing fields looks like this: agg = HDFAggregator('table1.hdf') agg.add_group_function('MONTH', ...
) agg.add_group_function('SEGMENT', ... ) agg.add_group_function('STATUS', ... ) agg.do_aggregation(groupby='MONTH SEGMENT STATUS', count='CUST_NO', sum='<list of fields to sum>) agg.add_calculated_field('PROFIT', 'VOLUME*(PRICE* (1-DISCOUNT)-COGS) - COSTOFSALES') agg.save('output.hdf') On my laptop, this zips over my data at a speed of 400k rows/sec, aggregating it into 230,000 groups (48 months x 120 customer segments/subsegments x 5 product groups x 8 statuses) with subtotals for 30 data fields in each group. This is essentially as fast as PyTables can read in the HDF files from disk; peak speeds with fewer groups (e.g. 48x5x1x4) are above 1Mrows/sec if the HDF files are already in the disk cache.
One option I am considering now is bolting an SQL-like parser on the front to provide a more natural interface for those unfortunate people who prefer SQL to Python. I don't want to write an SQL parser from scratch, so it would be great to know if there are any existing projects to put an SQL-like interface on numpy or PyTables (other than numexpr).
So has anyone looked at using an SQL-like syntax for querying numpy/PyTables data?
Cheers
Stephen _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
participants (2)
-
Kevin Jacobs <jacobs@bioinformed.com>
-
Stephen Simmons