Adding an axis argument to numpy.unique
![](https://secure.gravatar.com/avatar/342bd0a61c7081db529c856d3bcd9545.jpg?s=120&d=mm&r=g)
Hi everyone, I've recently put together a pull request that adds an `axis` kwarg to `numpy.unique` so that `unique`can easily be used to find unique rows/columns/sub-arrays/etc of a larger array. https://github.com/numpy/numpy/pull/3584 Currently, this works as a warpper around `unique`. If `axis` is specified, it reshapes the input to a 2D contiguous array, views each row as a single item, then passes it on to `unique`. For int and string dtypes, each row is viewed as a void dtype and therefore bitwise-equality is used for comparisons. For all other dtypes, the each row is viewed as a structured array. The current implementation has two main drawbacks: 1. For anything other than ints and strings, it's relatively slow. 2. It doesn't work with object arrays of any sort. I'd appreciate any thoughts/feedback folks might have on both the general idea and this specific implementation. It think it's a worthwhile addition, but I'm biased. Thanks! -Joe
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Sun, Aug 18, 2013 at 7:14 PM, Joe Kington <joferkington@gmail.com> wrote:
Hi everyone,
I've recently put together a pull request that adds an `axis` kwarg to `numpy.unique` so that `unique`can easily be used to find unique rows/columns/sub-arrays/etc of a larger array.
https://github.com/numpy/numpy/pull/3584
Currently, this works as a warpper around `unique`. If `axis` is specified, it reshapes the input to a 2D contiguous array, views each row as a single item, then passes it on to `unique`. For int and string dtypes, each row is viewed as a void dtype and therefore bitwise-equality is used for comparisons. For all other dtypes, the each row is viewed as a structured array.
The current implementation has two main drawbacks:
For anything other than ints and strings, it's relatively slow. It doesn't work with object arrays of any sort.
I'd appreciate any thoughts/feedback folks might have on both the general idea and this specific implementation. It think it's a worthwhile addition, but I'm biased.
just a general comment I have been missing a `unique_rows` or something like that, which seems to be the target of this change. However, my first interpretation of an axis argument in unique would be that it treats each column (or whatever along axis) separately. Analogously to max, argmax and similar. On second thought: unique with axis working on each column separately wouldn't create a nice return array, because it won't be rectangular (in general) Josef
Thanks! -Joe
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
![](https://secure.gravatar.com/avatar/342bd0a61c7081db529c856d3bcd9545.jpg?s=120&d=mm&r=g)
...<snip>
However, my first interpretation of an axis argument in unique would be that it treats each column (or whatever along axis) separately. Analogously to max, argmax and similar.
Good point! That's certainly a potential source of confusion. However, I can't seem to come up with a better name for the kwarg. Matlab's "unique" function has a "rows" option, which is probably a more intuitive name, but doesn't imply the expansion to N-dimensions. "axis" is still fairly idiomatic, despite the confusion over "unique rows/columns/etc" vs "unique items within each row/column/etc". Any thoughts on a better name for the argument?
On second thought: unique with axis working on each column separately wouldn't create a nice return array, because it won't be rectangular (in general)
Josef
Yeah, and "unique items within each row/column/etc" would be best implemented as a one-line list comprehension for that reason, rather than an addition to unique, i.m.o. Thanks for the feedback! -Joe
![](https://secure.gravatar.com/avatar/af6c39d6943bd4b0e1fde23161e7bb8c.jpg?s=120&d=mm&r=g)
On Tue, Aug 20, 2013 at 2:39 AM, Joe Kington <joferkington@gmail.com> wrote:
That's certainly a potential source of confusion. However, I can't seem to come up with a better name for the kwarg. Matlab's "unique" function has a "rows" option, which is probably a more intuitive name, but doesn't imply the expansion to N-dimensions.
To me, `unique_rows` sounds perfect. To go along columns: unique_rows(A.T) Stéfan
![](https://secure.gravatar.com/avatar/342bd0a61c7081db529c856d3bcd9545.jpg?s=120&d=mm&r=g)
To me, `unique_rows` sounds perfect. To go along columns: unique_rows(A.T)
Stéfan Personally, I like this idea as well. A separate `unique_rows` function, which potentially takes an `axis` argument. (Alternately, `unique_sequences` wouldn't imply a particular axis.) Of course, the obvious downside to this is namespace pollution. The upside would be discoverability.
![](https://secure.gravatar.com/avatar/97c543aca1ac7bbcfb5279d0300c8330.jpg?s=120&d=mm&r=g)
On 20 Aug 2013 01:39, "Joe Kington" <joferkington@gmail.com> wrote:
...<snip>
However, my first interpretation of an axis argument in unique would be that it treats each column (or whatever along axis) separately. Analogously to max, argmax and similar.
Good point!
That's certainly a potential source of confusion. However, I can't seem
to come up with a better name for the kwarg. Matlab's "unique" function has a "rows" option, which is probably a more intuitive name, but doesn't imply the expansion to N-dimensions.
"axis" is still fairly idiomatic, despite the confusion over "unique
rows/columns/etc" vs "unique items within each row/column/etc".
Any thoughts on a better name for the argument?
I also found this pretty confusing when first looking at the PR. One option might be to invert the sense of the argument to emphasize that it's treating subarrays as units, so instead of specifying the iteration axis you specify the axes of the subarray. compare_axis= or something? -n
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Tue, Aug 20, 2013 at 5:04 AM, Nathaniel Smith <njs@pobox.com> wrote:
On 20 Aug 2013 01:39, "Joe Kington" <joferkington@gmail.com> wrote:
...<snip>
However, my first interpretation of an axis argument in unique would be that it treats each column (or whatever along axis) separately. Analogously to max, argmax and similar.
Good point!
That's certainly a potential source of confusion. However, I can't seem to come up with a better name for the kwarg. Matlab's "unique" function has a "rows" option, which is probably a more intuitive name, but doesn't imply the expansion to N-dimensions.
"axis" is still fairly idiomatic, despite the confusion over "unique rows/columns/etc" vs "unique items within each row/column/etc".
Any thoughts on a better name for the argument?
I also found this pretty confusing when first looking at the PR.
One option might be to invert the sense of the argument to emphasize that it's treating subarrays as units, so instead of specifying the iteration axis you specify the axes of the subarray. compare_axis= or something?
you would need compare_axes (plural for ndim>2) and have to specify all but one axis, AFAICS. Josef
-n
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
![](https://secure.gravatar.com/avatar/97c543aca1ac7bbcfb5279d0300c8330.jpg?s=120&d=mm&r=g)
On 20 Aug 2013 12:09, <josef.pktd@gmail.com> wrote:
On Tue, Aug 20, 2013 at 5:04 AM, Nathaniel Smith <njs@pobox.com> wrote:
On 20 Aug 2013 01:39, "Joe Kington" <joferkington@gmail.com> wrote:
...<snip>
However, my first interpretation of an axis argument in unique would be that it treats each column (or whatever along axis) separately. Analogously to max, argmax and similar.
Good point!
That's certainly a potential source of confusion. However, I can't
to come up with a better name for the kwarg. Matlab's "unique" function has a "rows" option, which is probably a more intuitive name, but doesn't imply the expansion to N-dimensions.
"axis" is still fairly idiomatic, despite the confusion over "unique rows/columns/etc" vs "unique items within each row/column/etc".
Any thoughts on a better name for the argument?
I also found this pretty confusing when first looking at the PR.
One option might be to invert the sense of the argument to emphasize
seem that
it's treating subarrays as units, so instead of specifying the iteration axis you specify the axes of the subarray. compare_axis= or something?
you would need compare_axes (plural for ndim>2) and have to specify all but one axis, AFAICS.
Well, it makes sense to specify any arbitrary subset of axes, whether or not that's currently implemented. -n
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Tue, Aug 20, 2013 at 7:34 AM, Nathaniel Smith <njs@pobox.com> wrote:
On 20 Aug 2013 12:09, <josef.pktd@gmail.com> wrote:
On Tue, Aug 20, 2013 at 5:04 AM, Nathaniel Smith <njs@pobox.com> wrote:
On 20 Aug 2013 01:39, "Joe Kington" <joferkington@gmail.com> wrote:
...<snip>
However, my first interpretation of an axis argument in unique would be that it treats each column (or whatever along axis) separately. Analogously to max, argmax and similar.
Good point!
That's certainly a potential source of confusion. However, I can't seem to come up with a better name for the kwarg. Matlab's "unique" function has a "rows" option, which is probably a more intuitive name, but doesn't imply the expansion to N-dimensions.
"axis" is still fairly idiomatic, despite the confusion over "unique rows/columns/etc" vs "unique items within each row/column/etc".
Any thoughts on a better name for the argument?
I also found this pretty confusing when first looking at the PR.
One option might be to invert the sense of the argument to emphasize that it's treating subarrays as units, so instead of specifying the iteration axis you specify the axes of the subarray. compare_axis= or something?
you would need compare_axes (plural for ndim>2) and have to specify all but one axis, AFAICS.
Well, it makes sense to specify any arbitrary subset of axes, whether or not that's currently implemented.
not AFAICS, if you want to return a rectangular array without nans/missing values. Josef
-n
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Tue, Aug 20, 2013 at 7:47 AM, <josef.pktd@gmail.com> wrote:
On Tue, Aug 20, 2013 at 7:34 AM, Nathaniel Smith <njs@pobox.com> wrote:
On 20 Aug 2013 12:09, <josef.pktd@gmail.com> wrote:
On Tue, Aug 20, 2013 at 5:04 AM, Nathaniel Smith <njs@pobox.com> wrote:
On 20 Aug 2013 01:39, "Joe Kington" <joferkington@gmail.com> wrote:
...<snip>
However, my first interpretation of an axis argument in unique would be that it treats each column (or whatever along axis) separately. Analogously to max, argmax and similar.
Good point!
That's certainly a potential source of confusion. However, I can't seem to come up with a better name for the kwarg. Matlab's "unique" function has a "rows" option, which is probably a more intuitive name, but doesn't imply the expansion to N-dimensions.
"axis" is still fairly idiomatic, despite the confusion over "unique rows/columns/etc" vs "unique items within each row/column/etc".
Any thoughts on a better name for the argument?
I also found this pretty confusing when first looking at the PR.
One option might be to invert the sense of the argument to emphasize that it's treating subarrays as units, so instead of specifying the iteration axis you specify the axes of the subarray. compare_axis= or something?
you would need compare_axes (plural for ndim>2) and have to specify all but one axis, AFAICS.
Well, it makes sense to specify any arbitrary subset of axes, whether or not that's currently implemented.
not AFAICS, if you want to return a rectangular array without nans/missing values.
and unless you want to ravel() the remaining axis, which is also weird (I think). Josef
Josef
-n
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
![](https://secure.gravatar.com/avatar/97c543aca1ac7bbcfb5279d0300c8330.jpg?s=120&d=mm&r=g)
On 20 Aug 2013 12:53, <josef.pktd@gmail.com> wrote:
On Tue, Aug 20, 2013 at 7:47 AM, <josef.pktd@gmail.com> wrote:
On Tue, Aug 20, 2013 at 7:34 AM, Nathaniel Smith <njs@pobox.com> wrote:
On 20 Aug 2013 12:09, <josef.pktd@gmail.com> wrote:
On Tue, Aug 20, 2013 at 5:04 AM, Nathaniel Smith <njs@pobox.com>
wrote:
On 20 Aug 2013 01:39, "Joe Kington" <joferkington@gmail.com> wrote:
...<snip> > > > However, my first interpretation of an axis argument in unique
would
> be that it treats each column (or whatever along axis) separately. > Analogously to max, argmax and similar.
Good point!
That's certainly a potential source of confusion. However, I can't seem to come up with a better name for the kwarg. Matlab's "unique" function has a "rows" option, which is probably a more intuitive name, but doesn't imply the expansion to N-dimensions.
"axis" is still fairly idiomatic, despite the confusion over "unique rows/columns/etc" vs "unique items within each row/column/etc".
Any thoughts on a better name for the argument?
I also found this pretty confusing when first looking at the PR.
One option might be to invert the sense of the argument to emphasize that it's treating subarrays as units, so instead of specifying the iteration axis you specify the axes of the subarray. compare_axis= or something?
you would need compare_axes (plural for ndim>2) and have to specify all but one axis, AFAICS.
Well, it makes sense to specify any arbitrary subset of axes, whether or not that's currently implemented.
not AFAICS, if you want to return a rectangular array without nans/missing values.
and unless you want to ravel() the remaining axis, which is also weird (I think).
The default (and until this patch, only) behaviour is to ravel all axes, so it'd be consistent. -n
participants (4)
-
Joe Kington
-
josef.pktd@gmail.com
-
Nathaniel Smith
-
Stéfan van der Walt