Mailman 3 New function `count_unique` to generate contingency tables. - NumPy-Discussion

New function `count_unique` to generate contingency tables.

older
What should recfromcsv defaults be?

Warren Weckesser

Aug. 12, 2014

3:35 p.m.

I created a pull request (https://github.com/numpy/numpy/pull/4958) that defines the function `count_unique`. `count_unique` generates a contingency table from a collection of sequences. For example, In [7]: x = [1, 1, 1, 1, 2, 2, 2, 2, 2] In [8]: y = [3, 4, 3, 3, 3, 4, 5, 5, 5] In [9]: (xvals, yvals), counts = count_unique(x, y) In [10]: xvals Out[10]: array([1, 2]) In [11]: yvals Out[11]: array([3, 4, 5]) In [12]: counts Out[12]: array([[3, 1, 0], [1, 1, 3]]) It can be interpreted as a multi-argument generalization of `np.unique(x, return_counts=True)`. It overlaps with Pandas' `crosstab`, but I think this is a pretty fundamental counting operation that fits in numpy. Matlab's `crosstab` (http://www.mathworks.com/help/stats/crosstab.html) and R's `table` perform the same calculation (with a few more bells and whistles). For comparison, here's Pandas' `crosstab` (same `x` and `y` as above): In [28]: import pandas as pd In [29]: xs = pd.Series(x) In [30]: ys = pd.Series(y) In [31]: pd.crosstab(xs, ys) Out[31]: col_0 3 4 5 row_0 1 3 1 0 2 1 1 3 And here is R's `table`:

...

Is there any interest in adding this (or some variation of it) to numpy? Warren

Attachments:

attachment.htm (text/html — 1.9 KB)

Show replies by date

Warren Weckesser

August 2014

3:57 p.m.

On Tue, Aug 12, 2014 at 11:35 AM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...

While searching StackOverflow in the numpy tag for "count unique", I just discovered that I basically reinvented Eelco Hoogendoorn's code in his answer to http://stackoverflow.com/questions/10741346/numpy-frequency-counts-for-uniqu.... Nice one, Eelco! Warren

Eelco Hoogendoorn

4:17 p.m.

New subject: New function `count_unique` to generate contingency tables.

Thanks. Prompted by that stackoverflow question, and similar problems I had to deal with myself, I started working on a much more general extension to numpy's functionality in this space. Like you noted, things get a little panda-y, but I think there is a lot of panda's functionality that could or should be part of the numpy core, a robust set of grouping operations in particular. see pastebin here: http://pastebin.com/c5WLWPbp Ive posted about it on this list before, but without apparent interest; and I havnt gotten around to getting this up to professional standards yet either. But there is a lot more that could be done in this direction. Note that the count functionality in the stackoverflow answer is relatively indirect and inefficient, using the inverse_index and such. A much more efficient method is obtained by the code used here. On Tue, Aug 12, 2014 at 5:57 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...

Joe Kington

4:33 p.m.

New subject: New function `count_unique` to generate contingency tables.

On Tue, Aug 12, 2014 at 11:17 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...

On a side note, this is related to a pull request of mine from awhile back: https://github.com/numpy/numpy/pull/3584 There was a lot of disagreement on the mailing list about what to call a "unique slices along a given axis" function, so I wound up closing the pull request pending more discussion. At any rate, I think it's a useful thing to have in "base" numpy.

Eelco Hoogendoorn

4:51 p.m.

New subject: New function `count_unique` to generate contingency tables.

ah yes, that's also an issue I was trying to deal with. the semantics I prefer in these type of operators, is (as a default), to have every array be treated as a sequence of keys, so if calling unique(arr_2d), youd get unique rows, unless you pass axis=None, in which case the array is flattened. I also agree that the extension you propose here is useful; but ideally, with a little more discussion on these subjects we can converge on an even more comprehensive overhaul On Tue, Aug 12, 2014 at 6:33 PM, Joe Kington <joferkington@gmail.com> wrote:

...

Warren Weckesser

8:57 p.m.

New subject: New function `count_unique` to generate contingency tables.

On Tue, Aug 12, 2014 at 12:51 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...

Update: I renamed the function to `table` in the pull request: https://github.com/numpy/numpy/pull/4958 Warren

Benjamin Root

9:15 p.m.

New subject: New function `count_unique` to generate contingency tables.

The ever-wonderful pylab mode in matplotlib has a table function for plotting a table of text in a plot. If I remember correctly, what would happen is that matplotlib's table() function will simply obliterate the numpy's table function. This isn't a show-stopper, I just wanted to point that out. Personally, while I wasn't a particular fan of "count_unique" because I wouldn't necessarially think of it when needing a contingency table, I do like that it is verb-ish. "table()", in this sense, is not a verb. That said, I am perfectly fine with it if you are fine with the name collision in pylab mode. On Wed, Aug 13, 2014 at 4:57 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...

On Tue, Aug 12, 2014 at 12:51 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...
ah yes, that's also an issue I was trying to deal with. the semantics I prefer in these type of operators, is (as a default), to have every array be treated as a sequence of keys, so if calling unique(arr_2d), youd get unique rows, unless you pass axis=None, in which case the array is flattened.

I also agree that the extension you propose here is useful; but ideally, with a little more discussion on these subjects we can converge on an even more comprehensive overhaul

On Tue, Aug 12, 2014 at 6:33 PM, Joe Kington <joferkington@gmail.com> wrote:

...
On Tue, Aug 12, 2014 at 11:17 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...
Thanks. Prompted by that stackoverflow question, and similar problems I had to deal with myself, I started working on a much more general extension to numpy's functionality in this space. Like you noted, things get a little panda-y, but I think there is a lot of panda's functionality that could or should be part of the numpy core, a robust set of grouping operations in particular.

see pastebin here: http://pastebin.com/c5WLWPbp

On a side note, this is related to a pull request of mine from awhile back: https://github.com/numpy/numpy/pull/3584

There was a lot of disagreement on the mailing list about what to call a "unique slices along a given axis" function, so I wound up closing the pull request pending more discussion.

At any rate, I think it's a useful thing to have in "base" numpy.

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Update: I renamed the function to `table` in the pull request: https://github.com/numpy/numpy/pull/4958

Warren

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Warren Weckesser

9:25 p.m.

New subject: New function `count_unique` to generate contingency tables.

On Wed, Aug 13, 2014 at 5:15 PM, Benjamin Root <ben.root@ou.edu> wrote:

...

Thanks for pointing that out. I only changed it to have something that sounded more table-ish, like the Pandas, R and Matlab functions. I won't update it right now, but if there is interest in putting it into numpy, I'll rename it to avoid the pylab conflict. Anything along the lines of `crosstab`, `xtable`, etc., would be fine with me. Warren

...

On Wed, Aug 13, 2014 at 4:57 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...
On Tue, Aug 12, 2014 at 12:51 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...
ah yes, that's also an issue I was trying to deal with. the semantics I prefer in these type of operators, is (as a default), to have every array be treated as a sequence of keys, so if calling unique(arr_2d), youd get unique rows, unless you pass axis=None, in which case the array is flattened.

I also agree that the extension you propose here is useful; but ideally, with a little more discussion on these subjects we can converge on an even more comprehensive overhaul

On Tue, Aug 12, 2014 at 6:33 PM, Joe Kington <joferkington@gmail.com> wrote:

...
On Tue, Aug 12, 2014 at 11:17 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...
Thanks. Prompted by that stackoverflow question, and similar problems I had to deal with myself, I started working on a much more general extension to numpy's functionality in this space. Like you noted, things get a little panda-y, but I think there is a lot of panda's functionality that could or should be part of the numpy core, a robust set of grouping operations in particular.

see pastebin here: http://pastebin.com/c5WLWPbp

On a side note, this is related to a pull request of mine from awhile back: https://github.com/numpy/numpy/pull/3584

There was a lot of disagreement on the mailing list about what to call a "unique slices along a given axis" function, so I wound up closing the pull request pending more discussion.

At any rate, I think it's a useful thing to have in "base" numpy.

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Update: I renamed the function to `table` in the pull request: https://github.com/numpy/numpy/pull/4958

Warren

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Eelco Hoogendoorn

10:17 p.m.

New subject: New function `count_unique` to generate contingency tables.

Its pretty easy to implement this table functionality and more on top of the code I linked above. I still think such a comprehensive overhaul of arraysetops is worth discussing. import numpy as np import grouping x = [1, 1, 1, 1, 2, 2, 2, 2, 2] y = [3, 4, 3, 3, 3, 4, 5, 5, 5] z = np.random.randint(0,2,(9,2)) def table(*keys): """ desired table implementation, building on the index object cleaner, and more functionality performance should be the same """ indices = [grouping.as_index(k, axis=0) for k in keys] uniques = [i.unique for i in indices] inverses = [i.inverse for i in indices] shape = [i.groups for i in indices] t = np.zeros(shape, np.int) np.add.at(t, inverses, 1) return tuple(uniques), t #here is how to use print table(x,y) #but we can use fancy keys as well; here a composite key and a row-key print table((x,y), z) #this effectively creates a sparse matrix equivalent of your desired table print grouping.count((x,y)) On Wed, Aug 13, 2014 at 11:25 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...

On Wed, Aug 13, 2014 at 5:15 PM, Benjamin Root <ben.root@ou.edu> wrote:

...
The ever-wonderful pylab mode in matplotlib has a table function for plotting a table of text in a plot. If I remember correctly, what would happen is that matplotlib's table() function will simply obliterate the numpy's table function. This isn't a show-stopper, I just wanted to point that out.

Personally, while I wasn't a particular fan of "count_unique" because I wouldn't necessarially think of it when needing a contingency table, I do like that it is verb-ish. "table()", in this sense, is not a verb. That said, I am perfectly fine with it if you are fine with the name collision in pylab mode.

Thanks for pointing that out. I only changed it to have something that sounded more table-ish, like the Pandas, R and Matlab functions. I won't update it right now, but if there is interest in putting it into numpy, I'll rename it to avoid the pylab conflict. Anything along the lines of `crosstab`, `xtable`, etc., would be fine with me.

Warren

...
On Wed, Aug 13, 2014 at 4:57 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...
On Tue, Aug 12, 2014 at 12:51 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...
ah yes, that's also an issue I was trying to deal with. the semantics I prefer in these type of operators, is (as a default), to have every array be treated as a sequence of keys, so if calling unique(arr_2d), youd get unique rows, unless you pass axis=None, in which case the array is flattened.

I also agree that the extension you propose here is useful; but ideally, with a little more discussion on these subjects we can converge on an even more comprehensive overhaul

On Tue, Aug 12, 2014 at 6:33 PM, Joe Kington <joferkington@gmail.com> wrote:

...
On Tue, Aug 12, 2014 at 11:17 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...
Thanks. Prompted by that stackoverflow question, and similar problems I had to deal with myself, I started working on a much more general extension to numpy's functionality in this space. Like you noted, things get a little panda-y, but I think there is a lot of panda's functionality that could or should be part of the numpy core, a robust set of grouping operations in particular.

see pastebin here: http://pastebin.com/c5WLWPbp

On a side note, this is related to a pull request of mine from awhile back: https://github.com/numpy/numpy/pull/3584

There was a lot of disagreement on the mailing list about what to call a "unique slices along a given axis" function, so I wound up closing the pull request pending more discussion.

At any rate, I think it's a useful thing to have in "base" numpy.

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Update: I renamed the function to `table` in the pull request: https://github.com/numpy/numpy/pull/4958

Warren

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Warren Weckesser

January 2015

6:48 p.m.

New subject: New function `count_unique` to generate contingency tables.

On Wed, Aug 13, 2014 at 6:17 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...

Its pretty easy to implement this table functionality and more on top of the code I linked above. I still think such a comprehensive overhaul of arraysetops is worth discussing.

import numpy as np import grouping x = [1, 1, 1, 1, 2, 2, 2, 2, 2] y = [3, 4, 3, 3, 3, 4, 5, 5, 5] z = np.random.randint(0,2,(9,2)) def table(*keys): """ desired table implementation, building on the index object cleaner, and more functionality performance should be the same """ indices = [grouping.as_index(k, axis=0) for k in keys] uniques = [i.unique for i in indices] inverses = [i.inverse for i in indices] shape = [i.groups for i in indices] t = np.zeros(shape, np.int) np.add.at(t, inverses, 1) return tuple(uniques), t #here is how to use print table(x,y) #but we can use fancy keys as well; here a composite key and a row-key print table((x,y), z) #this effectively creates a sparse matrix equivalent of your desired table print grouping.count((x,y))

On Wed, Aug 13, 2014 at 11:25 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...
On Wed, Aug 13, 2014 at 5:15 PM, Benjamin Root <ben.root@ou.edu> wrote:

...
The ever-wonderful pylab mode in matplotlib has a table function for plotting a table of text in a plot. If I remember correctly, what would happen is that matplotlib's table() function will simply obliterate the numpy's table function. This isn't a show-stopper, I just wanted to point that out.

Personally, while I wasn't a particular fan of "count_unique" because I wouldn't necessarially think of it when needing a contingency table, I do like that it is verb-ish. "table()", in this sense, is not a verb. That said, I am perfectly fine with it if you are fine with the name collision in pylab mode.

Thanks for pointing that out. I only changed it to have something that sounded more table-ish, like the Pandas, R and Matlab functions. I won't update it right now, but if there is interest in putting it into numpy, I'll rename it to avoid the pylab conflict. Anything along the lines of `crosstab`, `xtable`, etc., would be fine with me.

Warren

...
On Wed, Aug 13, 2014 at 4:57 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...
On Tue, Aug 12, 2014 at 12:51 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...
ah yes, that's also an issue I was trying to deal with. the semantics I prefer in these type of operators, is (as a default), to have every array be treated as a sequence of keys, so if calling unique(arr_2d), youd get unique rows, unless you pass axis=None, in which case the array is flattened.

I also agree that the extension you propose here is useful; but ideally, with a little more discussion on these subjects we can converge on an even more comprehensive overhaul

On Tue, Aug 12, 2014 at 6:33 PM, Joe Kington <joferkington@gmail.com> wrote:

...
On Tue, Aug 12, 2014 at 11:17 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

> Thanks. Prompted by that stackoverflow question, and similar > problems I had to deal with myself, I started working on a much more > general extension to numpy's functionality in this space. Like you noted, > things get a little panda-y, but I think there is a lot of panda's > functionality that could or should be part of the numpy core, a robust set > of grouping operations in particular. > > see pastebin here: > http://pastebin.com/c5WLWPbp >

On a side note, this is related to a pull request of mine from awhile back: https://github.com/numpy/numpy/pull/3584

There was a lot of disagreement on the mailing list about what to call a "unique slices along a given axis" function, so I wound up closing the pull request pending more discussion.

At any rate, I think it's a useful thing to have in "base" numpy.

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Update: I renamed the function to `table` in the pull request: https://github.com/numpy/numpy/pull/4958

Warren

Hey all, I'm reviving this thread about the proposed `table` enhancement in https://github.com/numpy/numpy/pull/4958, because Chuck has poked me (via the pull request ) about it, so I'm poking the mailing list. Ignoring the issue of the name for the moment, is there any opposition to adding the proposed `table` function to numpy? I don't think it would preclude adding more powerful tools later, but that's not something I have time to work on at the moment. If the only issue is the name, I'm open to any suggestions. I started with `count_unique`, and changed it to `table`, but Benjamin pointed out the potential conflict of `table` with a matplotlib function. Warren _______________________________________________

...

Warren Weckesser

7 p.m.

New subject: New function `count_unique` to generate contingency tables.

On Sun, Jan 25, 2015 at 1:48 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...

On Wed, Aug 13, 2014 at 6:17 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...
Its pretty easy to implement this table functionality and more on top of the code I linked above. I still think such a comprehensive overhaul of arraysetops is worth discussing.

import numpy as np import grouping x = [1, 1, 1, 1, 2, 2, 2, 2, 2] y = [3, 4, 3, 3, 3, 4, 5, 5, 5] z = np.random.randint(0,2,(9,2)) def table(*keys): """ desired table implementation, building on the index object cleaner, and more functionality performance should be the same """ indices = [grouping.as_index(k, axis=0) for k in keys] uniques = [i.unique for i in indices] inverses = [i.inverse for i in indices] shape = [i.groups for i in indices] t = np.zeros(shape, np.int) np.add.at(t, inverses, 1) return tuple(uniques), t #here is how to use print table(x,y) #but we can use fancy keys as well; here a composite key and a row-key print table((x,y), z) #this effectively creates a sparse matrix equivalent of your desired table print grouping.count((x,y))

On Wed, Aug 13, 2014 at 11:25 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...
On Wed, Aug 13, 2014 at 5:15 PM, Benjamin Root <ben.root@ou.edu> wrote:

...
The ever-wonderful pylab mode in matplotlib has a table function for plotting a table of text in a plot. If I remember correctly, what would happen is that matplotlib's table() function will simply obliterate the numpy's table function. This isn't a show-stopper, I just wanted to point that out.

Personally, while I wasn't a particular fan of "count_unique" because I wouldn't necessarially think of it when needing a contingency table, I do like that it is verb-ish. "table()", in this sense, is not a verb. That said, I am perfectly fine with it if you are fine with the name collision in pylab mode.

Thanks for pointing that out. I only changed it to have something that sounded more table-ish, like the Pandas, R and Matlab functions. I won't update it right now, but if there is interest in putting it into numpy, I'll rename it to avoid the pylab conflict. Anything along the lines of `crosstab`, `xtable`, etc., would be fine with me.

Warren

...
On Wed, Aug 13, 2014 at 4:57 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...
On Tue, Aug 12, 2014 at 12:51 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...
ah yes, that's also an issue I was trying to deal with. the semantics I prefer in these type of operators, is (as a default), to have every array be treated as a sequence of keys, so if calling unique(arr_2d), youd get unique rows, unless you pass axis=None, in which case the array is flattened.

I also agree that the extension you propose here is useful; but ideally, with a little more discussion on these subjects we can converge on an even more comprehensive overhaul

On Tue, Aug 12, 2014 at 6:33 PM, Joe Kington <joferkington@gmail.com> wrote:

> > > > On Tue, Aug 12, 2014 at 11:17 AM, Eelco Hoogendoorn < > hoogendoorn.eelco@gmail.com> wrote: > >> Thanks. Prompted by that stackoverflow question, and similar >> problems I had to deal with myself, I started working on a much more >> general extension to numpy's functionality in this space. Like you noted, >> things get a little panda-y, but I think there is a lot of panda's >> functionality that could or should be part of the numpy core, a robust set >> of grouping operations in particular. >> >> see pastebin here: >> http://pastebin.com/c5WLWPbp >> > > On a side note, this is related to a pull request of mine from > awhile back: https://github.com/numpy/numpy/pull/3584 > > There was a lot of disagreement on the mailing list about what to > call a "unique slices along a given axis" function, so I wound up closing > the pull request pending more discussion. > > At any rate, I think it's a useful thing to have in "base" numpy. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > >

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Update: I renamed the function to `table` in the pull request: https://github.com/numpy/numpy/pull/4958

Warren

Hey all,

I'm reviving this thread about the proposed `table` enhancement in https://github.com/numpy/numpy/pull/4958, because Chuck has poked me (via the pull request ) about it, so I'm poking the mailing list. Ignoring the issue of the name for the moment, is there any opposition to adding the proposed `table` function to numpy? I don't think it would preclude adding more powerful tools later, but that's not something I have time to work on at the moment.

If the only issue is the name, I'm open to any suggestions. I started with `count_unique`, and changed it to `table`, but Benjamin pointed out the potential conflict of `table` with a matplotlib function.

Warren

Looks like the original email in the thread is not part of the quoted (and somewhat disordered) emails. Here's my original email from last August: http://mail.scipy.org/pipermail/numpy-discussion/2014-August/070941.html Warren

...

Aldcroft, Thomas

7:32 p.m.

New subject: New function `count_unique` to generate contingency tables.

On Tue, Aug 12, 2014 at 12:17 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...

FYI I wrote some table grouping operations (join, hstack, vstack) for numpy some time ago, available here: https://github.com/astropy/astropy/blob/v0.4.x/astropy/table/np_utils.py These are part of the astropy project but this module has no actual astropy dependencies apart from a local backport of OrderedDict for Python < 2.7. Cheers, Tom

...

see pastebin here: http://pastebin.com/c5WLWPbp

Ive posted about it on this list before, but without apparent interest; and I havnt gotten around to getting this up to professional standards yet either. But there is a lot more that could be done in this direction.

Note that the count functionality in the stackoverflow answer is relatively indirect and inefficient, using the inverse_index and such. A much more efficient method is obtained by the code used here.

On Tue, Aug 12, 2014 at 5:57 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...
On Tue, Aug 12, 2014 at 11:35 AM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...
I created a pull request (https://github.com/numpy/numpy/pull/4958) that defines the function `count_unique`. `count_unique` generates a contingency table from a collection of sequences. For example,

In [7]: x = [1, 1, 1, 1, 2, 2, 2, 2, 2]

In [8]: y = [3, 4, 3, 3, 3, 4, 5, 5, 5]

In [9]: (xvals, yvals), counts = count_unique(x, y)

In [10]: xvals Out[10]: array([1, 2])

In [11]: yvals Out[11]: array([3, 4, 5])

In [12]: counts Out[12]: array([[3, 1, 0], [1, 1, 3]])

It can be interpreted as a multi-argument generalization of `np.unique(x, return_counts=True)`.

It overlaps with Pandas' `crosstab`, but I think this is a pretty fundamental counting operation that fits in numpy.

Matlab's `crosstab` (http://www.mathworks.com/help/stats/crosstab.html) and R's `table` perform the same calculation (with a few more bells and whistles).

For comparison, here's Pandas' `crosstab` (same `x` and `y` as above):

In [28]: import pandas as pd

In [29]: xs = pd.Series(x)

In [30]: ys = pd.Series(y)

In [31]: pd.crosstab(xs, ys) Out[31]: col_0 3 4 5 row_0 1 3 1 0 2 1 1 3

And here is R's `table`:

...
x <- c(1,1,1,1,2,2,2,2,2) y <- c(3,4,3,3,3,4,5,5,5) table(x, y) y x 3 4 5 1 3 1 0 2 1 1 3

Is there any interest in adding this (or some variation of it) to numpy?

Warren

While searching StackOverflow in the numpy tag for "count unique", I just discovered that I basically reinvented Eelco Hoogendoorn's code in his answer to http://stackoverflow.com/questions/10741346/numpy-frequency-counts-for-uniqu.... Nice one, Eelco!

Warren

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Warren Weckesser

August 2014

3:57 p.m.

On Tue, Aug 12, 2014 at 11:35 AM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...

Eelco Hoogendoorn

4:17 p.m.

New subject: New function `count_unique` to generate contingency tables.

...

Joe Kington

4:33 p.m.

New subject: New function `count_unique` to generate contingency tables.

On Tue, Aug 12, 2014 at 11:17 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...

Eelco Hoogendoorn

4:51 p.m.

New subject: New function `count_unique` to generate contingency tables.

...

Warren Weckesser

8:57 p.m.

New subject: New function `count_unique` to generate contingency tables.

On Tue, Aug 12, 2014 at 12:51 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...

Update: I renamed the function to `table` in the pull request: https://github.com/numpy/numpy/pull/4958 Warren

Benjamin Root

9:15 p.m.

New subject: New function `count_unique` to generate contingency tables.

...

On Tue, Aug 12, 2014 at 12:51 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...
ah yes, that's also an issue I was trying to deal with. the semantics I prefer in these type of operators, is (as a default), to have every array be treated as a sequence of keys, so if calling unique(arr_2d), youd get unique rows, unless you pass axis=None, in which case the array is flattened.

I also agree that the extension you propose here is useful; but ideally, with a little more discussion on these subjects we can converge on an even more comprehensive overhaul

On Tue, Aug 12, 2014 at 6:33 PM, Joe Kington <joferkington@gmail.com> wrote:

...
On Tue, Aug 12, 2014 at 11:17 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...
Thanks. Prompted by that stackoverflow question, and similar problems I had to deal with myself, I started working on a much more general extension to numpy's functionality in this space. Like you noted, things get a little panda-y, but I think there is a lot of panda's functionality that could or should be part of the numpy core, a robust set of grouping operations in particular.

see pastebin here: http://pastebin.com/c5WLWPbp

On a side note, this is related to a pull request of mine from awhile back: https://github.com/numpy/numpy/pull/3584

There was a lot of disagreement on the mailing list about what to call a "unique slices along a given axis" function, so I wound up closing the pull request pending more discussion.

At any rate, I think it's a useful thing to have in "base" numpy.

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Update: I renamed the function to `table` in the pull request: https://github.com/numpy/numpy/pull/4958

Warren

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Warren Weckesser

August 2014

9:25 p.m.

New subject: New function `count_unique` to generate contingency tables.

On Wed, Aug 13, 2014 at 5:15 PM, Benjamin Root <ben.root@ou.edu> wrote:

...

On Wed, Aug 13, 2014 at 4:57 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...
On Tue, Aug 12, 2014 at 12:51 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...
ah yes, that's also an issue I was trying to deal with. the semantics I prefer in these type of operators, is (as a default), to have every array be treated as a sequence of keys, so if calling unique(arr_2d), youd get unique rows, unless you pass axis=None, in which case the array is flattened.

I also agree that the extension you propose here is useful; but ideally, with a little more discussion on these subjects we can converge on an even more comprehensive overhaul

On Tue, Aug 12, 2014 at 6:33 PM, Joe Kington <joferkington@gmail.com> wrote:

...
On Tue, Aug 12, 2014 at 11:17 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...
Thanks. Prompted by that stackoverflow question, and similar problems I had to deal with myself, I started working on a much more general extension to numpy's functionality in this space. Like you noted, things get a little panda-y, but I think there is a lot of panda's functionality that could or should be part of the numpy core, a robust set of grouping operations in particular.

see pastebin here: http://pastebin.com/c5WLWPbp

On a side note, this is related to a pull request of mine from awhile back: https://github.com/numpy/numpy/pull/3584

There was a lot of disagreement on the mailing list about what to call a "unique slices along a given axis" function, so I wound up closing the pull request pending more discussion.

At any rate, I think it's a useful thing to have in "base" numpy.

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Update: I renamed the function to `table` in the pull request: https://github.com/numpy/numpy/pull/4958

Warren

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Eelco Hoogendoorn

10:17 p.m.

New subject: New function `count_unique` to generate contingency tables.

...

On Wed, Aug 13, 2014 at 5:15 PM, Benjamin Root <ben.root@ou.edu> wrote:

...
The ever-wonderful pylab mode in matplotlib has a table function for plotting a table of text in a plot. If I remember correctly, what would happen is that matplotlib's table() function will simply obliterate the numpy's table function. This isn't a show-stopper, I just wanted to point that out.

Personally, while I wasn't a particular fan of "count_unique" because I wouldn't necessarially think of it when needing a contingency table, I do like that it is verb-ish. "table()", in this sense, is not a verb. That said, I am perfectly fine with it if you are fine with the name collision in pylab mode.

Thanks for pointing that out. I only changed it to have something that sounded more table-ish, like the Pandas, R and Matlab functions. I won't update it right now, but if there is interest in putting it into numpy, I'll rename it to avoid the pylab conflict. Anything along the lines of `crosstab`, `xtable`, etc., would be fine with me.

Warren

...
On Wed, Aug 13, 2014 at 4:57 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...
On Tue, Aug 12, 2014 at 12:51 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...
ah yes, that's also an issue I was trying to deal with. the semantics I prefer in these type of operators, is (as a default), to have every array be treated as a sequence of keys, so if calling unique(arr_2d), youd get unique rows, unless you pass axis=None, in which case the array is flattened.

I also agree that the extension you propose here is useful; but ideally, with a little more discussion on these subjects we can converge on an even more comprehensive overhaul

On Tue, Aug 12, 2014 at 6:33 PM, Joe Kington <joferkington@gmail.com> wrote:

...
On Tue, Aug 12, 2014 at 11:17 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...
Thanks. Prompted by that stackoverflow question, and similar problems I had to deal with myself, I started working on a much more general extension to numpy's functionality in this space. Like you noted, things get a little panda-y, but I think there is a lot of panda's functionality that could or should be part of the numpy core, a robust set of grouping operations in particular.

see pastebin here: http://pastebin.com/c5WLWPbp

On a side note, this is related to a pull request of mine from awhile back: https://github.com/numpy/numpy/pull/3584

There was a lot of disagreement on the mailing list about what to call a "unique slices along a given axis" function, so I wound up closing the pull request pending more discussion.

At any rate, I think it's a useful thing to have in "base" numpy.

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Update: I renamed the function to `table` in the pull request: https://github.com/numpy/numpy/pull/4958

Warren

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Warren Weckesser

January 2015

6:48 p.m.

New subject: New function `count_unique` to generate contingency tables.

On Wed, Aug 13, 2014 at 6:17 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...

Its pretty easy to implement this table functionality and more on top of the code I linked above. I still think such a comprehensive overhaul of arraysetops is worth discussing.

import numpy as np import grouping x = [1, 1, 1, 1, 2, 2, 2, 2, 2] y = [3, 4, 3, 3, 3, 4, 5, 5, 5] z = np.random.randint(0,2,(9,2)) def table(*keys): """ desired table implementation, building on the index object cleaner, and more functionality performance should be the same """ indices = [grouping.as_index(k, axis=0) for k in keys] uniques = [i.unique for i in indices] inverses = [i.inverse for i in indices] shape = [i.groups for i in indices] t = np.zeros(shape, np.int) np.add.at(t, inverses, 1) return tuple(uniques), t #here is how to use print table(x,y) #but we can use fancy keys as well; here a composite key and a row-key print table((x,y), z) #this effectively creates a sparse matrix equivalent of your desired table print grouping.count((x,y))

On Wed, Aug 13, 2014 at 11:25 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...
On Wed, Aug 13, 2014 at 5:15 PM, Benjamin Root <ben.root@ou.edu> wrote:

...
The ever-wonderful pylab mode in matplotlib has a table function for plotting a table of text in a plot. If I remember correctly, what would happen is that matplotlib's table() function will simply obliterate the numpy's table function. This isn't a show-stopper, I just wanted to point that out.

Personally, while I wasn't a particular fan of "count_unique" because I wouldn't necessarially think of it when needing a contingency table, I do like that it is verb-ish. "table()", in this sense, is not a verb. That said, I am perfectly fine with it if you are fine with the name collision in pylab mode.

Thanks for pointing that out. I only changed it to have something that sounded more table-ish, like the Pandas, R and Matlab functions. I won't update it right now, but if there is interest in putting it into numpy, I'll rename it to avoid the pylab conflict. Anything along the lines of `crosstab`, `xtable`, etc., would be fine with me.

Warren

...
On Wed, Aug 13, 2014 at 4:57 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...
On Tue, Aug 12, 2014 at 12:51 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...
ah yes, that's also an issue I was trying to deal with. the semantics I prefer in these type of operators, is (as a default), to have every array be treated as a sequence of keys, so if calling unique(arr_2d), youd get unique rows, unless you pass axis=None, in which case the array is flattened.

I also agree that the extension you propose here is useful; but ideally, with a little more discussion on these subjects we can converge on an even more comprehensive overhaul

On Tue, Aug 12, 2014 at 6:33 PM, Joe Kington <joferkington@gmail.com> wrote:

...
On Tue, Aug 12, 2014 at 11:17 AM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

> Thanks. Prompted by that stackoverflow question, and similar > problems I had to deal with myself, I started working on a much more > general extension to numpy's functionality in this space. Like you noted, > things get a little panda-y, but I think there is a lot of panda's > functionality that could or should be part of the numpy core, a robust set > of grouping operations in particular. > > see pastebin here: > http://pastebin.com/c5WLWPbp >

On a side note, this is related to a pull request of mine from awhile back: https://github.com/numpy/numpy/pull/3584

There was a lot of disagreement on the mailing list about what to call a "unique slices along a given axis" function, so I wound up closing the pull request pending more discussion.

At any rate, I think it's a useful thing to have in "base" numpy.

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Update: I renamed the function to `table` in the pull request: https://github.com/numpy/numpy/pull/4958

Warren

...

Warren Weckesser

7 p.m.

New subject: New function `count_unique` to generate contingency tables.

On Sun, Jan 25, 2015 at 1:48 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...

On Wed, Aug 13, 2014 at 6:17 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...
Its pretty easy to implement this table functionality and more on top of the code I linked above. I still think such a comprehensive overhaul of arraysetops is worth discussing.

import numpy as np import grouping x = [1, 1, 1, 1, 2, 2, 2, 2, 2] y = [3, 4, 3, 3, 3, 4, 5, 5, 5] z = np.random.randint(0,2,(9,2)) def table(*keys): """ desired table implementation, building on the index object cleaner, and more functionality performance should be the same """ indices = [grouping.as_index(k, axis=0) for k in keys] uniques = [i.unique for i in indices] inverses = [i.inverse for i in indices] shape = [i.groups for i in indices] t = np.zeros(shape, np.int) np.add.at(t, inverses, 1) return tuple(uniques), t #here is how to use print table(x,y) #but we can use fancy keys as well; here a composite key and a row-key print table((x,y), z) #this effectively creates a sparse matrix equivalent of your desired table print grouping.count((x,y))

On Wed, Aug 13, 2014 at 11:25 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...
On Wed, Aug 13, 2014 at 5:15 PM, Benjamin Root <ben.root@ou.edu> wrote:

...
The ever-wonderful pylab mode in matplotlib has a table function for plotting a table of text in a plot. If I remember correctly, what would happen is that matplotlib's table() function will simply obliterate the numpy's table function. This isn't a show-stopper, I just wanted to point that out.

Personally, while I wasn't a particular fan of "count_unique" because I wouldn't necessarially think of it when needing a contingency table, I do like that it is verb-ish. "table()", in this sense, is not a verb. That said, I am perfectly fine with it if you are fine with the name collision in pylab mode.

Thanks for pointing that out. I only changed it to have something that sounded more table-ish, like the Pandas, R and Matlab functions. I won't update it right now, but if there is interest in putting it into numpy, I'll rename it to avoid the pylab conflict. Anything along the lines of `crosstab`, `xtable`, etc., would be fine with me.

Warren

...
On Wed, Aug 13, 2014 at 4:57 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...
On Tue, Aug 12, 2014 at 12:51 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...
ah yes, that's also an issue I was trying to deal with. the semantics I prefer in these type of operators, is (as a default), to have every array be treated as a sequence of keys, so if calling unique(arr_2d), youd get unique rows, unless you pass axis=None, in which case the array is flattened.

I also agree that the extension you propose here is useful; but ideally, with a little more discussion on these subjects we can converge on an even more comprehensive overhaul

On Tue, Aug 12, 2014 at 6:33 PM, Joe Kington <joferkington@gmail.com> wrote:

> > > > On Tue, Aug 12, 2014 at 11:17 AM, Eelco Hoogendoorn < > hoogendoorn.eelco@gmail.com> wrote: > >> Thanks. Prompted by that stackoverflow question, and similar >> problems I had to deal with myself, I started working on a much more >> general extension to numpy's functionality in this space. Like you noted, >> things get a little panda-y, but I think there is a lot of panda's >> functionality that could or should be part of the numpy core, a robust set >> of grouping operations in particular. >> >> see pastebin here: >> http://pastebin.com/c5WLWPbp >> > > On a side note, this is related to a pull request of mine from > awhile back: https://github.com/numpy/numpy/pull/3584 > > There was a lot of disagreement on the mailing list about what to > call a "unique slices along a given axis" function, so I wound up closing > the pull request pending more discussion. > > At any rate, I think it's a useful thing to have in "base" numpy. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > >

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Update: I renamed the function to `table` in the pull request: https://github.com/numpy/numpy/pull/4958

Warren

Hey all,

I'm reviving this thread about the proposed `table` enhancement in https://github.com/numpy/numpy/pull/4958, because Chuck has poked me (via the pull request ) about it, so I'm poking the mailing list. Ignoring the issue of the name for the moment, is there any opposition to adding the proposed `table` function to numpy? I don't think it would preclude adding more powerful tools later, but that's not something I have time to work on at the moment.

If the only issue is the name, I'm open to any suggestions. I started with `count_unique`, and changed it to `table`, but Benjamin pointed out the potential conflict of `table` with a matplotlib function.

Warren

...

Aldcroft, Thomas

7:32 p.m.

New subject: New function `count_unique` to generate contingency tables.

On Tue, Aug 12, 2014 at 12:17 PM, Eelco Hoogendoorn < hoogendoorn.eelco@gmail.com> wrote:

...

see pastebin here: http://pastebin.com/c5WLWPbp

Ive posted about it on this list before, but without apparent interest; and I havnt gotten around to getting this up to professional standards yet either. But there is a lot more that could be done in this direction.

Note that the count functionality in the stackoverflow answer is relatively indirect and inefficient, using the inverse_index and such. A much more efficient method is obtained by the code used here.

On Tue, Aug 12, 2014 at 5:57 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...
On Tue, Aug 12, 2014 at 11:35 AM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...
I created a pull request (https://github.com/numpy/numpy/pull/4958) that defines the function `count_unique`. `count_unique` generates a contingency table from a collection of sequences. For example,

In [7]: x = [1, 1, 1, 1, 2, 2, 2, 2, 2]

In [8]: y = [3, 4, 3, 3, 3, 4, 5, 5, 5]

In [9]: (xvals, yvals), counts = count_unique(x, y)

In [10]: xvals Out[10]: array([1, 2])

In [11]: yvals Out[11]: array([3, 4, 5])

In [12]: counts Out[12]: array([[3, 1, 0], [1, 1, 3]])

It can be interpreted as a multi-argument generalization of `np.unique(x, return_counts=True)`.

It overlaps with Pandas' `crosstab`, but I think this is a pretty fundamental counting operation that fits in numpy.

Matlab's `crosstab` (http://www.mathworks.com/help/stats/crosstab.html) and R's `table` perform the same calculation (with a few more bells and whistles).

For comparison, here's Pandas' `crosstab` (same `x` and `y` as above):

In [28]: import pandas as pd

In [29]: xs = pd.Series(x)

In [30]: ys = pd.Series(y)

In [31]: pd.crosstab(xs, ys) Out[31]: col_0 3 4 5 row_0 1 3 1 0 2 1 1 3

And here is R's `table`:

...
x <- c(1,1,1,1,2,2,2,2,2) y <- c(3,4,3,3,3,4,5,5,5) table(x, y) y x 3 4 5 1 3 1 0 2 1 1 3

Is there any interest in adding this (or some variation of it) to numpy?

Warren

While searching StackOverflow in the numpy tag for "count unique", I just discovered that I basically reinvented Eelco Hoogendoorn's code in his answer to http://stackoverflow.com/questions/10741346/numpy-frequency-counts-for-uniqu.... Nice one, Eelco!

Warren

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

3675

Age (days ago)

3841

Last active (days ago)

List overview

Download

11 comments

5 participants

participants (5)

Aldcroft, Thomas
Benjamin Root
Eelco Hoogendoorn
Joe Kington
Warren Weckesser

New function `count_unique` to generate contingency tables.

Benjamin Root

Benjamin Root

tags

participants (5)