[AstroPy] astropy.table.Table groups: aggregate over & combine multiple columns?

Wed Jan 20 23:22:10 EST 2016

Hi, thanks for the great responses! In particular with a gist and a PR. Seems like there's certainly interest in this.
I realise the example is overly simplistic and can be solved (better) differently, but was indeed just meant as an example to the underlying issue.

I hadn't thought about iterating over the groups (Andrew's gist); good to keep in mind.

I had a look at Thomas's PR, but it feels slightly off somehow. 
I guess I'm rather partial to the way Pandas handles this, where the (row-sliced) dataframe gets passed to the user function. It's then up to the user to return a new and correct (single-row) dataframe, but easily allows fancy tricks with e.g. (boolean) indexing with yet another column, or even using the value of the key column(s) that have been aggregated over.

Since I already had half a mind to make a PR (if there were interest and I hadn't overlooked the obvious), I've now implemented a PR[1] which somewhat mimicks the Pandas one (in a more simplistic manner), for thoughts and comparison with Thomas's PR.
Since the previous sentence steers this topic in the direction of the astropy development mailing list, I'll leave the discussion on an implementation to Github.

It seems that in short, there is no neat way to do this directly in astropy, but it might be coming soon.

Thanks again,

  Evert

[1] https://github.com/astropy/astropy/pull/4516

> Hi Evert,
> 
> I wrote up a gist of the Numpy solution to the problem that I made passing mention of in my first reply. I have not benchmarked this against Tom's solution, but the solution in the following gist is based on running np.unique on an Astropy Table sorted on the grouping key. It's pretty fast, and it seems to require a bit less code. 
> 
> https://gist.github.com/aphearin/1a125a7ee6cd370740ef
> 
> I use this pattern all the time in my work; once you see how it works it's pretty straightforward to just copy-and-paste and adapt a couple of lines for your particular problem. When I can, I use the built-in Table aggregation features; when that doesn't work, I do this.  
> 
> The example shows how to compute the stellar mass-weighted average star formation-rate on a group-by-group basis for a large fake galaxy table generated at the beginning of the gist. The trick to make it fast is to loop over the memory buffer of the data in the Table; otherwise looping over the table elements directly is orders of magnitude slower. 
> 
> Cheers,
> Andrew
> 
> On Wed, Jan 20, 2016 at 2:34 PM, Aldcroft, Thomas <aldcroft at head.cfa.harvard.edu> wrote:
> There is now a working implementation of this if anyone wants to give it whirl or suggest improvements:
> 
> https://github.com/astropy/astropy/pull/4513
> 
> - Tom
> 
> 
> On Wed, Jan 20, 2016 at 12:40 PM, Aldcroft, Thomas <aldcroft at head.cfa.harvard.edu> wrote:
> Excellent question indeed.  
> 
> The first quick comment I have is to always be aware that directly using the functions np.sum and np.mean in aggregation will be orders of magnitude faster than calling the `average()` function that was defined in the original post.  That is because in those special cases the numpy `reduceat` method is called and everything gets done in the C layer.  Thus Andrew's suggestion for a workaround in this case is the right way to go if the data tables are large.
> 
> About the more generalized problem of getting access to the other table columns within the aggregation function, that is unfortunately not possible in the current release code.  I have an idea for doing this, which is now an astropy issue (https://github.com/astropy/astropy/issues/4513).
> 
> As for what to do right now with astropy 1.1, the following illustrates how to do generalized aggregation in the way that is needed for this example.  It will be relatively slow and possibly memory intensive, but if the tables are not huge that won't be a problem:
> 
> from __future__ import division, print_function
> from astropy import table
> from astropy.table import Table
> from collections import OrderedDict
> 
> t = Table([['a', 'a', 'a', 'b', 'b', 'c'],
>            [1, 2, 3, 4, 5, 6],
>            [2, 2, 1, 2, 1, 1]],
>           names=('name', 'value', 'weight'))
> 
> grouped = t.group_by('name')
> 
> 
> def transform_table(tbl):
>     """
>     Generalized function that takes table ``tbl`` as input
>     and returns a new Table ``out``.  Note that ``out`` does not
>     necessarily need to have the same types or columns.
> 
>     The example is just the identity transform.  Be aware that
>     in-place operations will affect the input table.
>     """
>     out = tbl
>     return out
> 
> out_tables = []
> for group in grouped.groups:
>     out_tables.append(transform_table(group))
> result = table.vstack(out_tables)
> print('transform_table')
> print(result)
> print()
> 
> 
> def average_weighted(tbl, name):
>     col = tbl[name]
>     if name == 'weight':
>         value = col.sum()
>     else:
>         weight = tbl['weight']
>         value = (col * weight).sum() / weight.sum()
> 
>     return value
> 
> 
> def transform_table_to_row(tbl, func):
>     """
>     Generalized function that takes table ``tbl`` as input
>     and returns a new table row as an OrderedDict.  It applies
>     function ``func`` to each column.
> 
>     The example computes the weighted average of each field (where
>     possible) assuming the weights are in column ``weight``.
>     """
>     out = OrderedDict()
>     for name in t.colnames:
>         try:
>             value = func(tbl, name)
>         except:
>             # If something went wrong just ignore (could not perform
>             # operation on this column).
>             pass
>         else:
>             out[name] = value
>     return out
> 
> 
> out_rows = []
> for group in grouped.groups:
>     out_rows.append(transform_table_to_row(group, average_weighted))
> result = Table(rows=out_rows)
> 
> print('transform_table_to_row')
> print(result)
> 
> Code also at https://gist.github.com/taldcroft/12249ad7eeacbec12f44.
> 
> Cheers,
> Tom
> 
> On Wed, Jan 20, 2016 at 7:47 AM, Andrew Hearin <andrew.hearin at yale.edu> wrote:
> Hi Evert,
> 
> Great question, I'm also really interested to hear the answer to this. I always use the built-in table aggregation functions when possible, but sometimes end up writing my own Numpy calculation for more complicated examples (using np.unique and/or np.searchsorted). 
> 
> For computing a group-wise weighted average, there is a way you can recast your problem that allows you to use the existing astropy built-in: just create a new column that is the product of your second and third columns, ' and then use aggregate(average)  in the normal way on this new column. 
> 
> So I *think* that gives an answer to the specific example you gave, but it dodges the real question, which I am also interested to hear the experts weigh in on. 
> 
> Andrew
> 
> On Tue, Jan 19, 2016 at 11:52 PM, Evert Rol <evert.rol at gmail.com> wrote:
> Is there a way in an astropy table to run the TableGroups aggregate function on multiple columns at once?
> 
> In this specific case, I'd like to group by names in one column, and then average the second column weighted by values in the third column.
> An example would be:
> 
>   from astropy.table import Table
> 
>   def average(col):
>       # Manipulate multiple columns at once?
>       return col.mean()
> 
>   t = Table([['a', 'a', 'a', 'b', 'b', 'c'],
>              [1, 2, 3, 4, 5, 6],
>              [2, 2, 1, 2, 1, 1]],
>             names=('name', 'value', 'weight'))
>   group = t.group_by('name')
>   result = group.groups.aggregate(average)
>   print(result)
> 
> which gives
> 
>   name value     weight
>   ---- ----- -------------
>      a   2.0 1.66666666667
>      b   4.5           1.5
>      c   6.0           1.0
> 
> which is not what I want.
> 
> 
> In Pandas, this can be done with apply() on a groupby object, since that passes the relevant subsection the dataframe as input to the function.
> So I can write:
> 
>   def average_pd(df):
>       weight = df['weight']
>       total = weight.sum()
>       df['value'] *= weight / total
>       df['value'] = df['value'].sum()
>       df['weight'] = total  # for info; not necessary
>       return df.iloc[0]  # ignore other rows: they are the same anyway
> 
>   df = t.to_pandas()
>   result = df.groupby('name')[['value', 'weight']].apply(average_pd)
>   print(result)
> 
> which gives:
> 
>          value  weight
> name
> a     1.800000       5
> b     4.333333       3
> c     6.000000       1
> 
> and 'value' consists of weighted averages.
> 
> (code also on https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_evertrol_12955a5d98edf055a2f4&d=AwICAg&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=AHkQ8HPUDwzl0x62ybAnwN_OEebPRGDtcjUPBcnLYw4&m=-BKjmG3hTRkdOfmFIKI5e3myB8cKiFHeJbTAhi3Zg5U&s=ix-QzeHis8ltaMFyVo3QvHpnQYri_s75MpTGsufcbqM&e=  )
> 
> 
> Perhaps I overlooked some documentation, but I can't find if this can be done in astropy.table. Or do I just need to approach this differently?
> Alternatively, should I convert & stick to Pandas for this type of functionality?
> 
> 
>   Evert
> 
> 
> 
> 
> 
> 
> _______________________________________________
> AstroPy mailing list
> AstroPy at scipy.org
> https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.scipy.org_mailman_listinfo_astropy&d=AwICAg&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=AHkQ8HPUDwzl0x62ybAnwN_OEebPRGDtcjUPBcnLYw4&m=-BKjmG3hTRkdOfmFIKI5e3myB8cKiFHeJbTAhi3Zg5U&s=XZ616g8wR7LBzFglTQ8J2F-bDe6rE-HuXIrePKntv6w&e=
> 
> 
> _______________________________________________
> AstroPy mailing list
> AstroPy at scipy.org
> https://mail.scipy.org/mailman/listinfo/astropy
> 
> 
> 
> 
> _______________________________________________
> AstroPy mailing list
> AstroPy at scipy.org
> https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.scipy.org_mailman_listinfo_astropy&d=AwICAg&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=AHkQ8HPUDwzl0x62ybAnwN_OEebPRGDtcjUPBcnLYw4&m=dCXpQy8L_TlVzBcoYmH89spQdsv-85GQVn8oW79SyRE&s=YhXoP-ihVwi4j73PmjzKCQJUJPGvzKR82ag-Kj9YXfk&e=
> 
> 
> _______________________________________________
> AstroPy mailing list
> AstroPy at scipy.org
> https://mail.scipy.org/mailman/listinfo/astropy