[Numpy-discussion] Numpy Enhancement Proposal: group_by functionality

Eelco Hoogendoorn hoogendoorn.eelco at gmail.com
Sun Jan 26 06:20:04 EST 2014


Hi all,

Please critique my draft exploring the possibilities of adding group_by
support to numpy:
http://pastebin.com/c5WLWPbp

In nearly ever project I work on, I require group_by functionality of some
sort. There are other libraries that provide this kind of functionality,
such as pandas for instance, but I will try to make the case here that
numpy ought to have a solid core of group_by functionality. Primarily, one
may argue that the concept of grouping values by a key is far more general
than a pandas dataframe. In particular, one often needs a simple one-line
transient association between some keys and values, and trying to wrangle
your problem into the more permanent and specialized datastructure that a
dataframe is, is simply not called for.

As a simple compact example:

key1 = list('abaabb')
key2 = np.random.randint(0,2,(6,2))
values = np.random.rand(6,3)
print group_by((key1, key2)).median(values)

Points of note; we can group by arbitrary combinations of keys, and
subarrays can also act as keys. group_by has a rich set of reduction
functionality, which performs efficient per-group reductions, as well as
various ways to split your values per group.

Also, the code here has a lot of overlap with np.unique and related
arraysetops. functions like np.unique are easily reimplemented using the
groundwork laid out here, and also may be extended to benefit from the
generalizations made, allowing for a wider variety of objects to have their
unique values taken; note the axis keyword here, meaning that what is
unique here are the images found along the first axis; not the elements of
shuffled.

#create a stack of images
images = np.random.rand(4,64,64)
 #shuffle the images; this is a giant mess now; how to find all the
original ones?
 shuffled = images[np.random.randint(0,4,200)]
 #there you go
 print unique(shuffled, axis=0)

Some more examples and unit tests can be found at the end of the module.

Id love to hear your feedback on this. Specifically:

   - Do you agree numpy would benefit from group_by functionality?
   - Do you have suggestions for further generalizations/extensions?
   - Any commentary on design decisions / implementation?

Regards,
Eelco Hoogendoorn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140126/a82d062a/attachment.html>


More information about the NumPy-Discussion mailing list