[AstroPy] astropy.table.Table groups: aggregate over & combine multiple columns?
Andrew Hearin
andrew.hearin at yale.edu
Wed Jan 20 16:11:08 EST 2016
Hi Evert,
I wrote up a gist of the Numpy solution to the problem that I made passing
mention of in my first reply. I have not benchmarked this against Tom's
solution, but the solution in the following gist is based on running
np.unique on an Astropy Table sorted on the grouping key. It's pretty fast,
and it seems to require a bit less code.
https://gist.github.com/aphearin/1a125a7ee6cd370740ef
I use this pattern all the time in my work; once you see how it works it's
pretty straightforward to just copy-and-paste and adapt a couple of lines
for your particular problem. When I can, I use the built-in Table
aggregation features; when that doesn't work, I do this.
The example shows how to compute the stellar mass-weighted average star
formation-rate on a group-by-group basis for a large fake galaxy table
generated at the beginning of the gist. The trick to make it fast is to
loop over the memory buffer of the data in the Table; otherwise looping
over the table elements directly is orders of magnitude slower.
Cheers,
Andrew
On Wed, Jan 20, 2016 at 2:34 PM, Aldcroft, Thomas <
aldcroft at head.cfa.harvard.edu> wrote:
> There is now a working implementation of this if anyone wants to give it
> whirl or suggest improvements:
>
> https://github.com/astropy/astropy/pull/4513
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_astropy_astropy_pull_4513&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=AHkQ8HPUDwzl0x62ybAnwN_OEebPRGDtcjUPBcnLYw4&m=dCXpQy8L_TlVzBcoYmH89spQdsv-85GQVn8oW79SyRE&s=gTrCkrDGkG2BY44Q2t3os-I6YwrbCCBqLSiGCORdTw4&e=>
>
> - Tom
>
>
> On Wed, Jan 20, 2016 at 12:40 PM, Aldcroft, Thomas <
> aldcroft at head.cfa.harvard.edu> wrote:
>
>> Excellent question indeed.
>>
>> The first quick comment I have is to always be aware that directly using
>> the functions np.sum and np.mean in aggregation will be orders of magnitude
>> faster than calling the `average()` function that was defined in the
>> original post. That is because in those special cases the numpy `reduceat`
>> method is called and everything gets done in the C layer. Thus Andrew's
>> suggestion for a workaround in this case is the right way to go if the data
>> tables are large.
>>
>> About the more generalized problem of getting access to the other table
>> columns within the aggregation function, that is unfortunately not possible
>> in the current release code. I have an idea for doing this, which is now
>> an astropy issue (https://github.com/astropy/astropy/issues/4513
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_astropy_astropy_issues_4513&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=AHkQ8HPUDwzl0x62ybAnwN_OEebPRGDtcjUPBcnLYw4&m=dCXpQy8L_TlVzBcoYmH89spQdsv-85GQVn8oW79SyRE&s=E5U0Dfaq5N_BspH6AcSrzBAV36mmoM3MGsGonCPL5i0&e=>
>> ).
>>
>> As for what to do right now with astropy 1.1, the following illustrates
>> how to do generalized aggregation in the way that is needed for this
>> example. It will be relatively slow and possibly memory intensive, but if
>> the tables are not huge that won't be a problem:
>>
>> from __future__ import division, print_function
>> from astropy import table
>> from astropy.table import Table
>> from collections import OrderedDict
>>
>> t = Table([['a', 'a', 'a', 'b', 'b', 'c'],
>> [1, 2, 3, 4, 5, 6],
>> [2, 2, 1, 2, 1, 1]],
>> names=('name', 'value', 'weight'))
>>
>> grouped = t.group_by('name')
>>
>>
>> def transform_table(tbl):
>> """
>> Generalized function that takes table ``tbl`` as input
>> and returns a new Table ``out``. Note that ``out`` does not
>> necessarily need to have the same types or columns.
>>
>> The example is just the identity transform. Be aware that
>> in-place operations will affect the input table.
>> """
>> out = tbl
>> return out
>>
>> out_tables = []
>> for group in grouped.groups:
>> out_tables.append(transform_table(group))
>> result = table.vstack(out_tables)
>> print('transform_table')
>> print(result)
>> print()
>>
>>
>> def average_weighted(tbl, name):
>> col = tbl[name]
>> if name == 'weight':
>> value = col.sum()
>> else:
>> weight = tbl['weight']
>> value = (col * weight).sum() / weight.sum()
>>
>> return value
>>
>>
>> def transform_table_to_row(tbl, func):
>> """
>> Generalized function that takes table ``tbl`` as input
>> and returns a new table row as an OrderedDict. It applies
>> function ``func`` to each column.
>>
>> The example computes the weighted average of each field (where
>> possible) assuming the weights are in column ``weight``.
>> """
>> out = OrderedDict()
>> for name in t.colnames:
>> try:
>> value = func(tbl, name)
>> except:
>> # If something went wrong just ignore (could not perform
>> # operation on this column).
>> pass
>> else:
>> out[name] = value
>> return out
>>
>>
>> out_rows = []
>> for group in grouped.groups:
>> out_rows.append(transform_table_to_row(group, average_weighted))
>> result = Table(rows=out_rows)
>>
>> print('transform_table_to_row')
>> print(result)
>>
>> Code also at https://gist.github.com/taldcroft/12249ad7eeacbec12f44
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_taldcroft_12249ad7eeacbec12f44&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=AHkQ8HPUDwzl0x62ybAnwN_OEebPRGDtcjUPBcnLYw4&m=dCXpQy8L_TlVzBcoYmH89spQdsv-85GQVn8oW79SyRE&s=qiTfvwXOmx2gG9WY5MtobhpSjQAYqknid05DX4qGDyY&e=>
>> .
>>
>> Cheers,
>> Tom
>>
>> On Wed, Jan 20, 2016 at 7:47 AM, Andrew Hearin <andrew.hearin at yale.edu>
>> wrote:
>>
>>> Hi Evert,
>>>
>>> Great question, I'm also really interested to hear the answer to this. I
>>> always use the built-in table aggregation functions when possible, but
>>> sometimes end up writing my own Numpy calculation for more complicated
>>> examples (using np.unique and/or np.searchsorted).
>>>
>>> For computing a group-wise weighted average, there is a way you can
>>> recast your problem that allows you to use the existing astropy built-in:
>>> just create a new column that is the product of your second and third
>>> columns, ' and then use aggregate(average) in the normal way on this new
>>> column.
>>>
>>> So I *think* that gives an answer to the specific example you gave, but
>>> it dodges the real question, which I am also interested to hear the experts
>>> weigh in on.
>>>
>>> Andrew
>>>
>>> On Tue, Jan 19, 2016 at 11:52 PM, Evert Rol <evert.rol at gmail.com> wrote:
>>>
>>>> Is there a way in an astropy table to run the TableGroups aggregate
>>>> function on multiple columns at once?
>>>>
>>>> In this specific case, I'd like to group by names in one column, and
>>>> then average the second column weighted by values in the third column.
>>>> An example would be:
>>>>
>>>> from astropy.table import Table
>>>>
>>>> def average(col):
>>>> # Manipulate multiple columns at once?
>>>> return col.mean()
>>>>
>>>> t = Table([['a', 'a', 'a', 'b', 'b', 'c'],
>>>> [1, 2, 3, 4, 5, 6],
>>>> [2, 2, 1, 2, 1, 1]],
>>>> names=('name', 'value', 'weight'))
>>>> group = t.group_by('name')
>>>> result = group.groups.aggregate(average)
>>>> print(result)
>>>>
>>>> which gives
>>>>
>>>> name value weight
>>>> ---- ----- -------------
>>>> a 2.0 1.66666666667
>>>> b 4.5 1.5
>>>> c 6.0 1.0
>>>>
>>>> which is not what I want.
>>>>
>>>>
>>>> In Pandas, this can be done with apply() on a groupby object, since
>>>> that passes the relevant subsection the dataframe as input to the function.
>>>> So I can write:
>>>>
>>>> def average_pd(df):
>>>> weight = df['weight']
>>>> total = weight.sum()
>>>> df['value'] *= weight / total
>>>> df['value'] = df['value'].sum()
>>>> df['weight'] = total # for info; not necessary
>>>> return df.iloc[0] # ignore other rows: they are the same anyway
>>>>
>>>> df = t.to_pandas()
>>>> result = df.groupby('name')[['value', 'weight']].apply(average_pd)
>>>> print(result)
>>>>
>>>> which gives:
>>>>
>>>> value weight
>>>> name
>>>> a 1.800000 5
>>>> b 4.333333 3
>>>> c 6.000000 1
>>>>
>>>> and 'value' consists of weighted averages.
>>>>
>>>> (code also on
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_evertrol_12955a5d98edf055a2f4&d=AwICAg&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=AHkQ8HPUDwzl0x62ybAnwN_OEebPRGDtcjUPBcnLYw4&m=-BKjmG3hTRkdOfmFIKI5e3myB8cKiFHeJbTAhi3Zg5U&s=ix-QzeHis8ltaMFyVo3QvHpnQYri_s75MpTGsufcbqM&e=
>>>> )
>>>>
>>>>
>>>> Perhaps I overlooked some documentation, but I can't find if this can
>>>> be done in astropy.table. Or do I just need to approach this differently?
>>>> Alternatively, should I convert & stick to Pandas for this type of
>>>> functionality?
>>>>
>>>>
>>>> Evert
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> AstroPy mailing list
>>>> AstroPy at scipy.org
>>>>
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.scipy.org_mailman_listinfo_astropy&d=AwICAg&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=AHkQ8HPUDwzl0x62ybAnwN_OEebPRGDtcjUPBcnLYw4&m=-BKjmG3hTRkdOfmFIKI5e3myB8cKiFHeJbTAhi3Zg5U&s=XZ616g8wR7LBzFglTQ8J2F-bDe6rE-HuXIrePKntv6w&e=
>>>>
>>>
>>>
>>> _______________________________________________
>>> AstroPy mailing list
>>> AstroPy at scipy.org
>>> https://mail.scipy.org/mailman/listinfo/astropy
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.scipy.org_mailman_listinfo_astropy&d=AwMFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=AHkQ8HPUDwzl0x62ybAnwN_OEebPRGDtcjUPBcnLYw4&m=dCXpQy8L_TlVzBcoYmH89spQdsv-85GQVn8oW79SyRE&s=YhXoP-ihVwi4j73PmjzKCQJUJPGvzKR82ag-Kj9YXfk&e=>
>>>
>>>
>>
>
> _______________________________________________
> AstroPy mailing list
> AstroPy at scipy.org
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.scipy.org_mailman_listinfo_astropy&d=AwICAg&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=AHkQ8HPUDwzl0x62ybAnwN_OEebPRGDtcjUPBcnLYw4&m=dCXpQy8L_TlVzBcoYmH89spQdsv-85GQVn8oW79SyRE&s=YhXoP-ihVwi4j73PmjzKCQJUJPGvzKR82ag-Kj9YXfk&e=
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/astropy/attachments/20160120/d3edb251/attachment.html>
More information about the AstroPy
mailing list