
Hello, This is my first e-mail, I will try to make the idea simple. Similar to masked array it would be interesting to use a label array to guide operations. Ex.:
sum(x) array([9, 12, 15])
The operations would create a new axis for label indexing. You could think of it as a collection of masks, one for each label. I don't know a way to make something like this efficiently without a loop. Just wondering... Sérgio.

Seems like you are talking about xarray: https://github.com/pydata/xarray Cheers! Ben Root On Fri, Feb 12, 2016 at 9:40 AM, Sérgio <filaboia@gmail.com> wrote:

Benjamin Root writes:
Seems like you are talking about xarray: https://github.com/pydata/xarray
Oh, I wasn't aware of xarray, but there's also this: https://people.gso.ac.upc.edu/vilanova/doc/sciexp2/user_guide/data.html#basi... https://people.gso.ac.upc.edu/vilanova/doc/sciexp2/user_guide/data.html#dime... Cheers, Lluis
Cheers! Ben Root
On Fri, Feb 12, 2016 at 9:40 AM, Sérgio <filaboia@gmail.com> wrote:
Hello,
This is my first e-mail, I will try to make the idea simple.
Similar to masked array it would be interesting to use a label array to guide operations.
sum(x) array([9, 12, 15])
The operations would create a new axis for label indexing.
You could think of it as a collection of masks, one for each label.
I don't know a way to make something like this efficiently without a loop. Just wondering...
Sérgio.

Just for posterity -- any future readers to this thread who need to do pandas-like on record arrays should look at matplotlib's mlab submodule. I've been in situations (::cough:: Esri production ::cough::) where I've had one hand tied behind my back and unable to install pandas. mlab was a big help there. https://goo.gl/M7Mi8B -paul On Mon, Feb 15, 2016 at 1:28 PM, Lluís Vilanova <vilanova@ac.upc.edu> wrote:

I also want to add a historical note here, that 'groupby' has been discussed a couple times before. Travis Oliphant even made an NEP for it, and Wes McKinney lightly hinted at adding it to numpy. http://thread.gmane.org/gmane.comp.python.numeric.general/37480/focus=37480 http://thread.gmane.org/gmane.comp.python.numeric.general/38272/focus=38299 http://docs.scipy.org/doc/numpy-1.10.1/neps/groupby_additions.html Travis's idea for a ufunc method 'reduceby' is more along the lines of what I was originally thinking. Just musing about it, it might cover few small cases pandas groupby might not: It could work on arbitrary ufuncs, and over particular axes of multidimensional data. Eg, to sum over pixels from NxNx3 image data. But maybe pandas can cover the multidimensional case through additional index columns or with Panel. Cheers, Allan On 02/15/2016 05:31 PM, Paul Hobson wrote:

On Fri, Feb 19, 2016 at 12:08 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
xarray is now covering that area. There are also recfunctions in numpy.lib that never got a lot of attention and expansion. There were plans to cover more of the matplotlib versions in numpy, but I have no idea and didn't check what happened to it.. Josef

matplotlib would be more than happy if numpy could take those functions off our hands! They don't get nearly the correct visibility in matplotlib because no one is expecting them to be in a plotting library, and they don't have any useful unit-tests. None of us made them, so we are very hesitant to update them because of that. Cheers! Ben Root On Fri, Feb 19, 2016 at 1:39 PM, <josef.pktd@gmail.com> wrote:

I've had a pretty similar idea for a new indexing function 'split_classes' which would help in your case, which essentially does def split_classes(c, v): return [v[c == u] for u in unique(c)] Your example could be coded as >>> [sum(c) for c in split_classes(label, data)] [9, 12, 15] I feel I've come across the need for such a function often enough that it might be generally useful to people as part of numpy. The implementation of split_classes above has pretty poor performance because it creates many temporary boolean arrays, so my plan for a PR was to have a speedy version of it that uses a single pass through v. (I often wanted to use this function on large datasets). If anyone has any comments on the idea (good idea. bad idea?) I'd love to hear. I have some further notes and examples here: https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21 Allan On 02/12/2016 09:40 AM, Sérgio wrote:

Sorry, to reply to myself here, but looking at it with fresh eyes maybe the performance of the naive version isn't too bad. Here's a comparison of the naive vs a better implementation: def split_classes_naive(c, v): return [v[c == u] for u in unique(c)] def split_classes(c, v): perm = c.argsort() csrt = c[perm] div = where(csrt[1:] != csrt[:-1])[0] + 1 return [v[x] for x in split(perm, div)]
In any case, maybe it is useful to Sergio or others. Allan On 02/13/2016 12:11 PM, Allan Haldane wrote:

On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
The usecases I recently started to target for similar things is 1 Million or more rows and 10000 uniques in the labels. The second version should be faster for large number of uniques, I guess. Overall numpy is falling far behind pandas in terms of simple groupby operations. bincount and histogram (IIRC) worked for some cases but are rather limited. reduce_at looks nice for cases where it applies. In contrast to the full sized labels in the original post, I only know of applications where the labels are 1-D corresponding to rows or columns. Josef

In [10]: pd.options.display.max_rows=10 In [13]: np.random.seed(1234) In [14]: c = np.random.randint(0,32,size=100000) In [15]: v = np.arange(100000) In [16]: df = DataFrame({'v' : v, 'c' : c}) In [17]: df Out[17]: c v 0 15 0 1 19 1 2 6 2 3 21 3 4 12 4 ... .. ... 99995 7 99995 99996 2 99996 99997 27 99997 99998 28 99998 99999 7 99999 [100000 rows x 2 columns] In [19]: df.groupby('c').count() Out[19]: v c 0 3136 1 3229 2 3093 3 3121 4 3041 .. ... 27 3128 28 3063 29 3147 30 3073 31 3090 [32 rows x 1 columns] In [20]: %timeit df.groupby('c').count() 100 loops, best of 3: 2 ms per loop In [21]: %timeit df.groupby('c').mean() 100 loops, best of 3: 2.39 ms per loop In [22]: df.groupby('c').mean() Out[22]: v c 0 49883.384885 1 50233.692165 2 48634.116069 3 50811.743992 4 50505.368629 .. ... 27 49715.349425 28 50363.501469 29 50485.395933 30 50190.155223 31 50691.041748 [32 rows x 1 columns] On Sat, Feb 13, 2016 at 1:29 PM, <josef.pktd@gmail.com> wrote:

These operations get slower as the number of groups increase, but with a faster function (e.g. the standard ones which are cythonized), the constant on the increase is pretty low. In [23]: c = np.random.randint(0,10000,size=100000) In [24]: df = DataFrame({'v' : v, 'c' : c}) In [25]: %timeit df.groupby('c').count() 100 loops, best of 3: 3.18 ms per loop In [26]: len(df.groupby('c').count()) Out[26]: 10000 In [27]: df.groupby('c').count() Out[27]: v c 0 9 1 11 2 7 3 8 4 16 ... .. 9995 11 9996 13 9997 13 9998 7 9999 10 [10000 rows x 1 columns] On Sat, Feb 13, 2016 at 1:39 PM, Jeff Reback <jeffreback@gmail.com> wrote:

On Sat, Feb 13, 2016 at 1:42 PM, Jeff Reback <jeffreback@gmail.com> wrote:
One other difference across usecases is whether this is a single operation, or we want to optimize the data format for a large number of different calculations. (We have both cases in statsmodels.) In the latter case it's worth spending some extra computational effort on rearranging the data to be either sorted or in lists of arrays, (I guess without having done any timings). Josef

Impressive! Possibly there's still a case for including a 'groupby' function in numpy itself since it's a generally useful operation, but I do see less of a need given the nice pandas functionality. At least, next time someone asks a stackoverflow question like the ones below someone should tell them to use pandas! (copied from my gist for future list reference). http://stackoverflow.com/questions/4373631/sum-array-by-number-in-numpy http://stackoverflow.com/questions/31483912/split-numpy-array-according-to-v... http://stackoverflow.com/questions/31863083/python-split-numpy-array-based-o... http://stackoverflow.com/questions/28599405/splitting-an-array-into-two-smal... http://stackoverflow.com/questions/7662458/how-to-split-an-array-according-t... Allan On 02/13/2016 01:39 PM, Jeff Reback wrote:

I believe this is basically a groupby, which is one of pandas's core competencies... even if numpy were to add some utilities for this kind of thing, then I doubt we'd do as well as them, so you might check whether pandas works for you first :-) On Feb 12, 2016 6:40 AM, "Sérgio" <filaboia@gmail.com> wrote:

Seems like you are talking about xarray: https://github.com/pydata/xarray Cheers! Ben Root On Fri, Feb 12, 2016 at 9:40 AM, Sérgio <filaboia@gmail.com> wrote:

Benjamin Root writes:
Seems like you are talking about xarray: https://github.com/pydata/xarray
Oh, I wasn't aware of xarray, but there's also this: https://people.gso.ac.upc.edu/vilanova/doc/sciexp2/user_guide/data.html#basi... https://people.gso.ac.upc.edu/vilanova/doc/sciexp2/user_guide/data.html#dime... Cheers, Lluis
Cheers! Ben Root
On Fri, Feb 12, 2016 at 9:40 AM, Sérgio <filaboia@gmail.com> wrote:
Hello,
This is my first e-mail, I will try to make the idea simple.
Similar to masked array it would be interesting to use a label array to guide operations.
sum(x) array([9, 12, 15])
The operations would create a new axis for label indexing.
You could think of it as a collection of masks, one for each label.
I don't know a way to make something like this efficiently without a loop. Just wondering...
Sérgio.

Just for posterity -- any future readers to this thread who need to do pandas-like on record arrays should look at matplotlib's mlab submodule. I've been in situations (::cough:: Esri production ::cough::) where I've had one hand tied behind my back and unable to install pandas. mlab was a big help there. https://goo.gl/M7Mi8B -paul On Mon, Feb 15, 2016 at 1:28 PM, Lluís Vilanova <vilanova@ac.upc.edu> wrote:

I also want to add a historical note here, that 'groupby' has been discussed a couple times before. Travis Oliphant even made an NEP for it, and Wes McKinney lightly hinted at adding it to numpy. http://thread.gmane.org/gmane.comp.python.numeric.general/37480/focus=37480 http://thread.gmane.org/gmane.comp.python.numeric.general/38272/focus=38299 http://docs.scipy.org/doc/numpy-1.10.1/neps/groupby_additions.html Travis's idea for a ufunc method 'reduceby' is more along the lines of what I was originally thinking. Just musing about it, it might cover few small cases pandas groupby might not: It could work on arbitrary ufuncs, and over particular axes of multidimensional data. Eg, to sum over pixels from NxNx3 image data. But maybe pandas can cover the multidimensional case through additional index columns or with Panel. Cheers, Allan On 02/15/2016 05:31 PM, Paul Hobson wrote:

On Fri, Feb 19, 2016 at 12:08 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
xarray is now covering that area. There are also recfunctions in numpy.lib that never got a lot of attention and expansion. There were plans to cover more of the matplotlib versions in numpy, but I have no idea and didn't check what happened to it.. Josef

matplotlib would be more than happy if numpy could take those functions off our hands! They don't get nearly the correct visibility in matplotlib because no one is expecting them to be in a plotting library, and they don't have any useful unit-tests. None of us made them, so we are very hesitant to update them because of that. Cheers! Ben Root On Fri, Feb 19, 2016 at 1:39 PM, <josef.pktd@gmail.com> wrote:

I've had a pretty similar idea for a new indexing function 'split_classes' which would help in your case, which essentially does def split_classes(c, v): return [v[c == u] for u in unique(c)] Your example could be coded as >>> [sum(c) for c in split_classes(label, data)] [9, 12, 15] I feel I've come across the need for such a function often enough that it might be generally useful to people as part of numpy. The implementation of split_classes above has pretty poor performance because it creates many temporary boolean arrays, so my plan for a PR was to have a speedy version of it that uses a single pass through v. (I often wanted to use this function on large datasets). If anyone has any comments on the idea (good idea. bad idea?) I'd love to hear. I have some further notes and examples here: https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21 Allan On 02/12/2016 09:40 AM, Sérgio wrote:

Sorry, to reply to myself here, but looking at it with fresh eyes maybe the performance of the naive version isn't too bad. Here's a comparison of the naive vs a better implementation: def split_classes_naive(c, v): return [v[c == u] for u in unique(c)] def split_classes(c, v): perm = c.argsort() csrt = c[perm] div = where(csrt[1:] != csrt[:-1])[0] + 1 return [v[x] for x in split(perm, div)]
In any case, maybe it is useful to Sergio or others. Allan On 02/13/2016 12:11 PM, Allan Haldane wrote:

On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
The usecases I recently started to target for similar things is 1 Million or more rows and 10000 uniques in the labels. The second version should be faster for large number of uniques, I guess. Overall numpy is falling far behind pandas in terms of simple groupby operations. bincount and histogram (IIRC) worked for some cases but are rather limited. reduce_at looks nice for cases where it applies. In contrast to the full sized labels in the original post, I only know of applications where the labels are 1-D corresponding to rows or columns. Josef

In [10]: pd.options.display.max_rows=10 In [13]: np.random.seed(1234) In [14]: c = np.random.randint(0,32,size=100000) In [15]: v = np.arange(100000) In [16]: df = DataFrame({'v' : v, 'c' : c}) In [17]: df Out[17]: c v 0 15 0 1 19 1 2 6 2 3 21 3 4 12 4 ... .. ... 99995 7 99995 99996 2 99996 99997 27 99997 99998 28 99998 99999 7 99999 [100000 rows x 2 columns] In [19]: df.groupby('c').count() Out[19]: v c 0 3136 1 3229 2 3093 3 3121 4 3041 .. ... 27 3128 28 3063 29 3147 30 3073 31 3090 [32 rows x 1 columns] In [20]: %timeit df.groupby('c').count() 100 loops, best of 3: 2 ms per loop In [21]: %timeit df.groupby('c').mean() 100 loops, best of 3: 2.39 ms per loop In [22]: df.groupby('c').mean() Out[22]: v c 0 49883.384885 1 50233.692165 2 48634.116069 3 50811.743992 4 50505.368629 .. ... 27 49715.349425 28 50363.501469 29 50485.395933 30 50190.155223 31 50691.041748 [32 rows x 1 columns] On Sat, Feb 13, 2016 at 1:29 PM, <josef.pktd@gmail.com> wrote:

These operations get slower as the number of groups increase, but with a faster function (e.g. the standard ones which are cythonized), the constant on the increase is pretty low. In [23]: c = np.random.randint(0,10000,size=100000) In [24]: df = DataFrame({'v' : v, 'c' : c}) In [25]: %timeit df.groupby('c').count() 100 loops, best of 3: 3.18 ms per loop In [26]: len(df.groupby('c').count()) Out[26]: 10000 In [27]: df.groupby('c').count() Out[27]: v c 0 9 1 11 2 7 3 8 4 16 ... .. 9995 11 9996 13 9997 13 9998 7 9999 10 [10000 rows x 1 columns] On Sat, Feb 13, 2016 at 1:39 PM, Jeff Reback <jeffreback@gmail.com> wrote:

On Sat, Feb 13, 2016 at 1:42 PM, Jeff Reback <jeffreback@gmail.com> wrote:
One other difference across usecases is whether this is a single operation, or we want to optimize the data format for a large number of different calculations. (We have both cases in statsmodels.) In the latter case it's worth spending some extra computational effort on rearranging the data to be either sorted or in lists of arrays, (I guess without having done any timings). Josef

Impressive! Possibly there's still a case for including a 'groupby' function in numpy itself since it's a generally useful operation, but I do see less of a need given the nice pandas functionality. At least, next time someone asks a stackoverflow question like the ones below someone should tell them to use pandas! (copied from my gist for future list reference). http://stackoverflow.com/questions/4373631/sum-array-by-number-in-numpy http://stackoverflow.com/questions/31483912/split-numpy-array-according-to-v... http://stackoverflow.com/questions/31863083/python-split-numpy-array-based-o... http://stackoverflow.com/questions/28599405/splitting-an-array-into-two-smal... http://stackoverflow.com/questions/7662458/how-to-split-an-array-according-t... Allan On 02/13/2016 01:39 PM, Jeff Reback wrote:

I believe this is basically a groupby, which is one of pandas's core competencies... even if numpy were to add some utilities for this kind of thing, then I doubt we'd do as well as them, so you might check whether pandas works for you first :-) On Feb 12, 2016 6:40 AM, "Sérgio" <filaboia@gmail.com> wrote:
participants (8)
-
Allan Haldane
-
Benjamin Root
-
Jeff Reback
-
josef.pktd@gmail.com
-
Lluís Vilanova
-
Nathaniel Smith
-
Paul Hobson
-
Sérgio