[AstroPy] astropy.table.Table groups: aggregate over & combine multiple columns?

Aldcroft, Thomas aldcroft at head.cfa.harvard.edu
Wed Jan 20 14:34:37 EST 2016


There is now a working implementation of this if anyone wants to give it
whirl or suggest improvements:

https://github.com/astropy/astropy/pull/4513

- Tom


On Wed, Jan 20, 2016 at 12:40 PM, Aldcroft, Thomas <
aldcroft at head.cfa.harvard.edu> wrote:

> Excellent question indeed.
>
> The first quick comment I have is to always be aware that directly using
> the functions np.sum and np.mean in aggregation will be orders of magnitude
> faster than calling the `average()` function that was defined in the
> original post.  That is because in those special cases the numpy `reduceat`
> method is called and everything gets done in the C layer.  Thus Andrew's
> suggestion for a workaround in this case is the right way to go if the data
> tables are large.
>
> About the more generalized problem of getting access to the other table
> columns within the aggregation function, that is unfortunately not possible
> in the current release code.  I have an idea for doing this, which is now
> an astropy issue (https://github.com/astropy/astropy/issues/4513).
>
> As for what to do right now with astropy 1.1, the following illustrates
> how to do generalized aggregation in the way that is needed for this
> example.  It will be relatively slow and possibly memory intensive, but if
> the tables are not huge that won't be a problem:
>
> from __future__ import division, print_function
> from astropy import table
> from astropy.table import Table
> from collections import OrderedDict
>
> t = Table([['a', 'a', 'a', 'b', 'b', 'c'],
>            [1, 2, 3, 4, 5, 6],
>            [2, 2, 1, 2, 1, 1]],
>           names=('name', 'value', 'weight'))
>
> grouped = t.group_by('name')
>
>
> def transform_table(tbl):
>     """
>     Generalized function that takes table ``tbl`` as input
>     and returns a new Table ``out``.  Note that ``out`` does not
>     necessarily need to have the same types or columns.
>
>     The example is just the identity transform.  Be aware that
>     in-place operations will affect the input table.
>     """
>     out = tbl
>     return out
>
> out_tables = []
> for group in grouped.groups:
>     out_tables.append(transform_table(group))
> result = table.vstack(out_tables)
> print('transform_table')
> print(result)
> print()
>
>
> def average_weighted(tbl, name):
>     col = tbl[name]
>     if name == 'weight':
>         value = col.sum()
>     else:
>         weight = tbl['weight']
>         value = (col * weight).sum() / weight.sum()
>
>     return value
>
>
> def transform_table_to_row(tbl, func):
>     """
>     Generalized function that takes table ``tbl`` as input
>     and returns a new table row as an OrderedDict.  It applies
>     function ``func`` to each column.
>
>     The example computes the weighted average of each field (where
>     possible) assuming the weights are in column ``weight``.
>     """
>     out = OrderedDict()
>     for name in t.colnames:
>         try:
>             value = func(tbl, name)
>         except:
>             # If something went wrong just ignore (could not perform
>             # operation on this column).
>             pass
>         else:
>             out[name] = value
>     return out
>
>
> out_rows = []
> for group in grouped.groups:
>     out_rows.append(transform_table_to_row(group, average_weighted))
> result = Table(rows=out_rows)
>
> print('transform_table_to_row')
> print(result)
>
> Code also at https://gist.github.com/taldcroft/12249ad7eeacbec12f44.
>
> Cheers,
> Tom
>
> On Wed, Jan 20, 2016 at 7:47 AM, Andrew Hearin <andrew.hearin at yale.edu>
> wrote:
>
>> Hi Evert,
>>
>> Great question, I'm also really interested to hear the answer to this. I
>> always use the built-in table aggregation functions when possible, but
>> sometimes end up writing my own Numpy calculation for more complicated
>> examples (using np.unique and/or np.searchsorted).
>>
>> For computing a group-wise weighted average, there is a way you can
>> recast your problem that allows you to use the existing astropy built-in:
>> just create a new column that is the product of your second and third
>> columns, ' and then use aggregate(average)  in the normal way on this new
>> column.
>>
>> So I *think* that gives an answer to the specific example you gave, but
>> it dodges the real question, which I am also interested to hear the experts
>> weigh in on.
>>
>> Andrew
>>
>> On Tue, Jan 19, 2016 at 11:52 PM, Evert Rol <evert.rol at gmail.com> wrote:
>>
>>> Is there a way in an astropy table to run the TableGroups aggregate
>>> function on multiple columns at once?
>>>
>>> In this specific case, I'd like to group by names in one column, and
>>> then average the second column weighted by values in the third column.
>>> An example would be:
>>>
>>>   from astropy.table import Table
>>>
>>>   def average(col):
>>>       # Manipulate multiple columns at once?
>>>       return col.mean()
>>>
>>>   t = Table([['a', 'a', 'a', 'b', 'b', 'c'],
>>>              [1, 2, 3, 4, 5, 6],
>>>              [2, 2, 1, 2, 1, 1]],
>>>             names=('name', 'value', 'weight'))
>>>   group = t.group_by('name')
>>>   result = group.groups.aggregate(average)
>>>   print(result)
>>>
>>> which gives
>>>
>>>   name value     weight
>>>   ---- ----- -------------
>>>      a   2.0 1.66666666667
>>>      b   4.5           1.5
>>>      c   6.0           1.0
>>>
>>> which is not what I want.
>>>
>>>
>>> In Pandas, this can be done with apply() on a groupby object, since that
>>> passes the relevant subsection the dataframe as input to the function.
>>> So I can write:
>>>
>>>   def average_pd(df):
>>>       weight = df['weight']
>>>       total = weight.sum()
>>>       df['value'] *= weight / total
>>>       df['value'] = df['value'].sum()
>>>       df['weight'] = total  # for info; not necessary
>>>       return df.iloc[0]  # ignore other rows: they are the same anyway
>>>
>>>   df = t.to_pandas()
>>>   result = df.groupby('name')[['value', 'weight']].apply(average_pd)
>>>   print(result)
>>>
>>> which gives:
>>>
>>>          value  weight
>>> name
>>> a     1.800000       5
>>> b     4.333333       3
>>> c     6.000000       1
>>>
>>> and 'value' consists of weighted averages.
>>>
>>> (code also on
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_evertrol_12955a5d98edf055a2f4&d=AwICAg&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=AHkQ8HPUDwzl0x62ybAnwN_OEebPRGDtcjUPBcnLYw4&m=-BKjmG3hTRkdOfmFIKI5e3myB8cKiFHeJbTAhi3Zg5U&s=ix-QzeHis8ltaMFyVo3QvHpnQYri_s75MpTGsufcbqM&e=
>>> )
>>>
>>>
>>> Perhaps I overlooked some documentation, but I can't find if this can be
>>> done in astropy.table. Or do I just need to approach this differently?
>>> Alternatively, should I convert & stick to Pandas for this type of
>>> functionality?
>>>
>>>
>>>   Evert
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> AstroPy mailing list
>>> AstroPy at scipy.org
>>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.scipy.org_mailman_listinfo_astropy&d=AwICAg&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=AHkQ8HPUDwzl0x62ybAnwN_OEebPRGDtcjUPBcnLYw4&m=-BKjmG3hTRkdOfmFIKI5e3myB8cKiFHeJbTAhi3Zg5U&s=XZ616g8wR7LBzFglTQ8J2F-bDe6rE-HuXIrePKntv6w&e=
>>>
>>
>>
>> _______________________________________________
>> AstroPy mailing list
>> AstroPy at scipy.org
>> https://mail.scipy.org/mailman/listinfo/astropy
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/astropy/attachments/20160120/f2fce2cd/attachment.html>


More information about the AstroPy mailing list