On Tue, Aug 12, 2014 at 12:17 PM, Eelco Hoogendoorn <hoogendoorn.eelco@gmail.com> wrote:

Thanks. Prompted by that stackoverflow question, and similar problems I had to deal with myself, I started working on a much more general extension to numpy's functionality in this space. Like you noted, things get a little panda-y, but I think there is a lot of panda's functionality that could or should be part of the numpy core, a robust set of grouping operations in particular.

FYI I wrote some table grouping operations (join, hstack, vstack) for numpy some time ago, available here:

https://github.com/astropy/astropy/blob/v0.4.x/astropy/table/np_utils.py

These are part of the astropy project but this module has no actual astropy dependencies apart from a local backport of OrderedDict for Python < 2.7.

Cheers,

Tom

see pastebin here:
http://pastebin.com/c5WLWPbp

Ive posted about it on this list before, but without apparent interest; and I havnt gotten around to getting this up to professional standards yet either. But there is a lot more that could be done in this direction.

Note that the count functionality in the stackoverflow answer is relatively indirect and inefficient, using the inverse_index and such. A much more efficient method is obtained by the code used here.

On Tue, Aug 12, 2014 at 5:57 PM, Warren Weckesser <warren.weckesser@gmail.com> wrote:

On Tue, Aug 12, 2014 at 11:35 AM, Warren Weckesser <warren.weckesser@gmail.com> wrote:

I created a pull request (https://github.com/numpy/numpy/pull/4958) that defines the function `count_unique`. `count_unique` generates a contingency table from a collection of sequences. For example,

In [7]: x = [1, 1, 1, 1, 2, 2, 2, 2, 2]

In [8]: y = [3, 4, 3, 3, 3, 4, 5, 5, 5]

In [9]: (xvals, yvals), counts = count_unique(x, y)

In [10]: xvals
Out[10]: array([1, 2])

In [11]: yvals
Out[11]: array([3, 4, 5])

In [12]: counts
Out[12]:
array([[3, 1, 0],
       [1, 1, 3]])

It can be interpreted as a multi-argument generalization of `np.unique(x, return_counts=True)`.

It overlaps with Pandas' `crosstab`, but I think this is a pretty fundamental counting operation that fits in numpy.

Matlab's `crosstab` (http://www.mathworks.com/help/stats/crosstab.html) and R's `table` perform the same calculation (with a few more bells and whistles).

For comparison, here's Pandas' `crosstab` (same `x` and `y` as above):

In [28]: import pandas as pd

In [29]: xs = pd.Series(x)

In [30]: ys = pd.Series(y)

In [31]: pd.crosstab(xs, ys)
Out[31]:
col_0 3 4 5
row_0
1      3 1 0
2      1 1 3

And here is R's `table`:

> x <- c(1,1,1,1,2,2,2,2,2)
> y <- c(3,4,3,3,3,4,5,5,5)
> table(x, y)
   y
x   3 4 5
1 3 1 0
2 1 1 3

Is there any interest in adding this (or some variation of it) to numpy?

Warren

While searching StackOverflow in the numpy tag for "count unique", I just discovered that I basically reinvented Eelco Hoogendoorn's code in his answer to http://stackoverflow.com/questions/10741346/numpy-frequency-counts-for-unique-values-in-an-array. Nice one, Eelco!

Warren

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion