[Numpy-discussion] Anyone written an SQL-like interface to numpy/PyTables?

Stephen Simmons mail at stevesimmons.com
Mon Jun 18 10:41:47 EDT 2007


Hi,

Has anyone written a parser for SQL-like queries against PyTables HDF 
tables or numpy recarrays?

I'm asking because I have written code for grouping then summing rows of 
source data, where the groups are defined by functions of the source 
data, or looking up a related field in a separate lookup tables. I use 
this for tracking the performance of customer segments by status, with 
48 monthly files of product usage/customer status data on 5m customers. 
Each of these is a 5m row HDF file, with several other 5m row HDF files 
that are used to work out which segment a customer belongs to.

This grouping and summing is equivalent to something like the following 
SQL code:
    SELECT
        grp_fn1(table1.*), grp_fn2(table1.*), grp_fn3(table2.*), 
grp_fn4(table3.*),
        count(table1.*),
        sum(table1.field1), sum(table1.field2), ..., sum(table1.fieldK)
    FROM table1 PARTITION(date)
    LEFT JOIN table2 ON table1.field0=table2.field0
    LEFT JOIN table3 ON table2.field0=table3.field0
    WHERE min_date<=date<=max_date
    GROUP BY grp_fn1(table1.*), grp_fn2(table1.*), grp_fn3(table2.*), 
grp_fn4(table3.*)

I'm using numpy.bincount() function to do the grouping/summing, 
numpy.searchsorted() for fast lookup tables implementing the grouping 
functions grp_fn(), and some other C functions for a fast "zip" join of 
related tables whose primary keys are in the same order as the monthly 
date partitions.

The Python code that specifies the grouping/summing fields looks like this:
   agg = HDFAggregator('table1.hdf')
   agg.add_group_function('MONTH', ...<min_date, max_date> )
   agg.add_group_function('SEGMENT', ... <lookup on table 2>  )
   agg.add_group_function('STATUS', ... <lookup on table 3> )
   agg.do_aggregation(groupby='MONTH SEGMENT STATUS', count='CUST_NO', 
sum='<list of fields to sum>)
   agg.add_calculated_field('PROFIT', 'VOLUME*(PRICE* (1-DISCOUNT)-COGS) 
- COSTOFSALES')
   agg.save('output.hdf')

On my laptop, this zips over my data at a speed of 400k rows/sec, 
aggregating it into 230,000 groups (48 months x 120 customer 
segments/subsegments x 5 product groups x 8 statuses) with subtotals for 
30 data fields in each group. This is essentially as fast as PyTables 
can read in the HDF files from disk; peak speeds with fewer groups (e.g. 
48x5x1x4) are above 1Mrows/sec if the HDF files are already in the disk 
cache.

One option I am considering now is bolting an SQL-like parser on the 
front to provide a more natural interface for those unfortunate people 
who prefer SQL to Python. I don't want to write an SQL parser from 
scratch, so it would be great to know if there are any existing projects 
to put an SQL-like interface on numpy or PyTables (other than numexpr).

So has anyone looked at using an SQL-like syntax for querying 
numpy/PyTables data?

Cheers

Stephen



More information about the NumPy-Discussion mailing list