Numpy Enhancement Proposal: group_by functionality
Hi all, Please critique my draft exploring the possibilities of adding group_by support to numpy: http://pastebin.com/c5WLWPbp In nearly ever project I work on, I require group_by functionality of some sort. There are other libraries that provide this kind of functionality, such as pandas for instance, but I will try to make the case here that numpy ought to have a solid core of group_by functionality. Primarily, one may argue that the concept of grouping values by a key is far more general than a pandas dataframe. In particular, one often needs a simple one-line transient association between some keys and values, and trying to wrangle your problem into the more permanent and specialized datastructure that a dataframe is, is simply not called for. As a simple compact example: key1 = list('abaabb') key2 = np.random.randint(0,2,(6,2)) values = np.random.rand(6,3) print group_by((key1, key2)).median(values) Points of note; we can group by arbitrary combinations of keys, and subarrays can also act as keys. group_by has a rich set of reduction functionality, which performs efficient per-group reductions, as well as various ways to split your values per group. Also, the code here has a lot of overlap with np.unique and related arraysetops. functions like np.unique are easily reimplemented using the groundwork laid out here, and also may be extended to benefit from the generalizations made, allowing for a wider variety of objects to have their unique values taken; note the axis keyword here, meaning that what is unique here are the images found along the first axis; not the elements of shuffled. #create a stack of images images = np.random.rand(4,64,64) #shuffle the images; this is a giant mess now; how to find all the original ones? shuffled = images[np.random.randint(0,4,200)] #there you go print unique(shuffled, axis=0) Some more examples and unit tests can be found at the end of the module. Id love to hear your feedback on this. Specifically: - Do you agree numpy would benefit from group_by functionality? - Do you have suggestions for further generalizations/extensions? - Any commentary on design decisions / implementation? Regards, Eelco Hoogendoorn
Hi Eelco On Sun, 26 Jan 2014 12:20:04 +0100, Eelco Hoogendoorn wrote:
key1 = list('abaabb') key2 = np.random.randint(0,2,(6,2)) values = np.random.rand(6,3) print group_by((key1, key2)).median(values)
I agree that group_by functionality could be handy in numpy. In the above example, what would the output of ``group_by((key1, key2))`` be? Stéfan
An object of type GroupBy.
So a call to group_by does not return any consumable output directly. If
you want for instance the unique keys, or groups if you will, you can call
GroupBy.unique. In this case, for a tuple of input keys, youd get a tuple
of unique keys back. If you want to compute several reductions over the
same set of keys, you can hang on to the GroupBy object, and the
precomputations it encapsulates.
To expand on that example: reduction operations also return the unique keys
which the reduced elements belong to:
(unique1, unique2), median = group_by((key1, key2)).median(values)
print unique1
print unique2
print median
yields something like
['a' 'a' 'b' 'b' 'a']
[[0 0]
[0 1]
[0 1]
[1 0]
[1 1]]
[[ 0.34041782 0.78579254 0.91494441]
[ 0.59422888 0.67915262 0.04327812]
[ 0.45045529 0.45049761 0.49633574]
[ 0.71623235 0.95760152 0.85137696]
[ 0.96299801 0.27639574 0.70519413]]
Note that the elements of unique1 and unique2 are not themselves unique,
but rather their elements zipped together are unique.
On Sun, Jan 26, 2014 at 6:02 PM, Stéfan van der Walt
Hi Eelco
On Sun, 26 Jan 2014 12:20:04 +0100, Eelco Hoogendoorn wrote:
key1 = list('abaabb') key2 = np.random.randint(0,2,(6,2)) values = np.random.rand(6,3) print group_by((key1, key2)).median(values)
I agree that group_by functionality could be handy in numpy. In the above example, what would the output of
``group_by((key1, key2))``
be?
Stéfan
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
To follow up with an example as to why it is useful that a temporary object is created, consider the following (taken from the radial reduction example): g = group_by(np.round(radius, 5).flatten()) pp.errorbar( g.unique, g.mean(sample.flatten())[1], g.std(sample.flatten())[1] / np.sqrt(g.count)) Creating the GroupBy object encapsulates the expense of 'indexing' the keys, which is the most expensive part of these operations. We would have to redo that four times here, if we didn't have access to the GroupBy object.
From looking at the numpy source, I get the impression that it is considered good practice not to overuse OOP. And I agree, but I think it is called for here.
On Sun, Jan 26, 2014 at 6:02 PM, Stéfan van der Walt
Hi Eelco
On Sun, 26 Jan 2014 12:20:04 +0100, Eelco Hoogendoorn wrote:
key1 = list('abaabb') key2 = np.random.randint(0,2,(6,2)) values = np.random.rand(6,3) print group_by((key1, key2)).median(values)
I agree that group_by functionality could be handy in numpy. In the above example, what would the output of
``group_by((key1, key2))``
be?
Stéfan
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On 1/26/2014 12:02 PM, Stéfan van der Walt wrote:
what would the output of
``group_by((key1, key2))``
I'd expect something named "groupby" to behave as below. Alan def groupby(seq, key): from collections import defaultdict groups = defaultdict(list) for item in seq: groups[key(item)].append(item) return groups print groupby(range(20), lambda x: x%2)
Alan:
The equivalent of that in my current draft would be group_by(keys, values),
which is shorthand for group_by(keys).group(values); a optional values
argument to the constructor of GroupBy is directly bound to return an
iterable over the grouped values; but we often want to bind different value
objects, with different operations, for the same set of keys, so it is
convenient to be able to delay the binding of the values argument. Also,
the third argument to group_by is an optional reduction function.
On Sun, Jan 26, 2014 at 6:57 PM, Alan G Isaac
On 1/26/2014 12:02 PM, Stéfan van der Walt wrote:
what would the output of
``group_by((key1, key2))``
I'd expect something named "groupby" to behave as below. Alan
def groupby(seq, key): from collections import defaultdict groups = defaultdict(list) for item in seq: groups[key(item)].append(item) return groups
print groupby(range(20), lambda x: x%2)
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
My comment is just on the name. I'd expect something named `groupby` to behave essentially like Mathematica's `GatherBy` command. http://reference.wolfram.com/mathematica/ref/GatherBy.html I think you are after something more like Matlab's grpstats: http://www.mathworks.com/help/stats/grpstats.html Perhaps the implicit reference to SQL justifies the name... Sorry if this seems off topic, Alan Isaac
not off topic at all; there are several matters of naming that I am not at
all settled on yet, and I don't think it is unimportant.
indeed, those are closely related functions, and I wasn't aware of them
yet, so that's some welcome additional perspective. The mathematica
function differs in that the keys are always function of the values; as per
your example as well. My proposed interface does not have that
constraint, but that behavior is of course easily obtained by something
like group_by(mapping(values), values).
indeed grpstats also has a lot of overlap, though it does not have the same
generality as my proposal.
its interesting to wonder where one gets ones ideas as to how to call what.
ive never worked with SQL much; I suppose I picked up this naming by
working with LINQ. I rather like group_by; it is more suitable to the
generality of the operations supported by the group_by object than
something like grpstats. The majority of my applications for grouping have
nothing whatsoever to do with statistics.
On Sun, Jan 26, 2014 at 8:44 PM, Alan G Isaac
My comment is just on the name. I'd expect something named `groupby` to behave essentially like Mathematica's `GatherBy` command. http://reference.wolfram.com/mathematica/ref/GatherBy.html
I think you are after something more like Matlab's grpstats: http://www.mathworks.com/help/stats/grpstats.html
Perhaps the implicit reference to SQL justifies the name...
Sorry if this seems off topic, Alan Isaac
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (3)
-
Alan G Isaac
-
Eelco Hoogendoorn
-
Stéfan van der Walt