finding elements that match any in a set
I have a numpy array, records, with named fields including a field named "integer_field". I have an array (or list) of values of interest, and I want to get the indexes where integer_field has any of those values.
Because I can do
indexes = np.where( records.integer_field > 5 )
I thought I could do
indexes = np.where( records.integer_field in values )
But that doesn't work. (As a side question I'm interested in why that doesn't work, when values is a python list.)
How can I get those indexes? I though perhaps I need a nested np.where, or some other twostep process, but it wasn't clear to me how to do it.
Thanks.
On Fri, May 27, 2011 at 12:48 PM, Michael Katz michaeladamkatz@yahoo.com wrote:
I have a numpy array, records, with named fields including a field named "integer_field". I have an array (or list) of values of interest, and I want to get the indexes where integer_field has any of those values.
Because I can do
indexes = np.where( records.integer_field > 5 )
I thought I could do
indexes = np.where( records.integer_field in values )
But that doesn't work. (As a side question I'm interested in why that doesn't work, when values is a python list.)
How can I get those indexes? I though perhaps I need a nested np.where, or some other twostep process, but it wasn't clear to me how to do it.
Check out this recent thread. I think the proposed class does what you want. It's more efficient than in1d, if values is small compared to the length of records.
http://thread.gmane.org/gmane.comp.python.scientific.user/29035/
Skipper
On 5/27/11 9:48 AM, Michael Katz wrote:
I have a numpy array, records, with named fields including a field named "integer_field". I have an array (or list) of values of interest, and I want to get the indexes where integer_field has any of those values.
Because I can do
indexes = np.where( records.integer_field > 5 )
I thought I could do
indexes = np.where( records.integer_field in values )
But that doesn't work. (As a side question I'm interested in why that doesn't work, when values is a python list.)
that doesn't work because the python list "in" operator doesn't understand arrays  so it is looking ot see if the entire array is in the list. actually, it doesn't even get that far:
In [16]: a in l  ValueError Traceback (most recent call last)
/Users/chris.barker/<ipython console> in <module>()
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
The ValueError results because it was decided that numpy array should not have a boolean value to avoid confusion  i.e. is na array true whenever it is nonempty (like a list), or when all it's elements are true, or????
When I read this question, I thought  hmmm, numpy needs something like "in", as the usual way: np.any(), would require a loop in this case. Then I read Skipper's message:
On 5/27/11 9:55 AM, Skipper Seabold wrote:
Check out this recent thread. I think the proposed class does what you want. It's more efficient than in1d, if values is small compared to the length of records.
So that class may be worthwhile, but I think np.in1d is exactly what you are looking for:
indexes = np.in1d( records.integer_field, values )
Funny I'd never noticed that before.
Chris
Yes, thanks, np.in1d is what I needed. I didn't know how to find that.
It still seems counterintuitive to me that
indexes = np.where( records.integer_field in values )
does not work whereas
indexes = np.where( records.integer_field > 5 )
does.
In one case numpy is overriding the > operator; it's not checking if an array is greater than 5, but whether each element in the array is greater than 5.
From a naive user's point of view, not knowing much about the difference between and in from a python point of view, it seems like in would get overridden the
same way.
________________________________ From: Christopher Barker Chris.Barker@noaa.gov To: Discussion of Numerical Python numpydiscussion@scipy.org Sent: Fri, May 27, 2011 5:48:37 PM Subject: Re: [Numpydiscussion] finding elements that match any in a set
On 5/27/11 9:48 AM, Michael Katz wrote:
I have a numpy array, records, with named fields including a field named "integer_field". I have an array (or list) of values of interest, and I want to get the indexes where integer_field has any of those values.
Because I can do
indexes = np.where( records.integer_field > 5 )
I thought I could do
indexes = np.where( records.integer_field in values )
But that doesn't work. (As a side question I'm interested in why that doesn't work, when values is a python list.)
that doesn't work because the python list "in" operator doesn't understand arrays  so it is looking ot see if the entire array is in the list. actually, it doesn't even get that far:
In [16]: a in l  ValueError Traceback (most recent call last)
/Users/chris.barker/<ipython console> in <module>()
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
The ValueError results because it was decided that numpy array should not have a boolean value to avoid confusion  i.e. is na array true whenever it is nonempty (like a list), or when all it's elements are true, or????
When I read this question, I thought  hmmm, numpy needs something like "in", as the usual way: np.any(), would require a loop in this case. Then I read Skipper's message:
On 5/27/11 9:55 AM, Skipper Seabold wrote:
Check out this recent thread. I think the proposed class does what you want. It's more efficient than in1d, if values is small compared to the length of records.
So that class may be worthwhile, but I think np.in1d is exactly what you are looking for:
indexes = np.in1d( records.integer_field, values )
Funny I'd never noticed that before.
Chris
On Sat, May 28, 2011 at 14:18, Michael Katz michaeladamkatz@yahoo.com wrote:
Yes, thanks, np.in1d is what I needed. I didn't know how to find that.
It still seems counterintuitive to me that
indexes = np.where( records.integer_field in values )
does not work whereas
indexes = np.where( records.integer_field > 5 ) does.
In one case numpy is overriding the > operator; it's not checking if an array is greater than 5, but whether each element in the array is greater than 5.
From a naive user's point of view, not knowing much about the difference between > and in from a python point of view, it seems like in would get overridden the same way.
The Python operators are turned into special method calls on one of the objects. Most of the special methods that define the mathematical operators come in pairs: __lt__ and __gt__, __add__ and __radd__, etc. So if we have (x > y) then x.__gt__(y) is checked first. If x does not know about the type of y, then y.__lt__(x) is checked. Similarly, for (x + y), x.__add__(y) is checked first, then y.__radd__(x) is checked. (myarray > 5), myarray.__gt__(5) is checked. numpy arrays do know about ints, so that works.
However, (myarray in mylist) turns into mylist.__contains__(myarray). Only the list object is ever checked for this method. There is no paired method myarray.__rcontains__(mylist) so there is nothing that numpy can override to make this operation do anything different from what lists normally do, which is check if the given object is equal to one of the items in the list.
Thanks for the explanation. It sounds like Python needs __rcontains__.
________________________________ From: Robert Kern robert.kern@gmail.com To: Discussion of Numerical Python numpydiscussion@scipy.org Sent: Sat, May 28, 2011 12:30:05 PM Subject: Re: [Numpydiscussion] finding elements that match any in a set
On Sat, May 28, 2011 at 14:18, Michael Katz michaeladamkatz@yahoo.com wrote:
Yes, thanks, np.in1d is what I needed. I didn't know how to find that.
It still seems counterintuitive to me that
indexes = np.where( records.integer_field in values )
does not work whereas
indexes = np.where( records.integer_field > 5 )
does.
In one case numpy is overriding the > operator; it's not checking if an array is greater than 5, but whether each element in the array is greater than 5.
From a naive user's point of view, not knowing much about the difference between > and in from a python point of view, it seems like in would get overridden the same way.
The Python operators are turned into special method calls on one of the objects. Most of the special methods that define the mathematical operators come in pairs: __lt__ and __gt__, __add__ and __radd__, etc. So if we have (x > y) then x.__gt__(y) is checked first. If x does not know about the type of y, then y.__lt__(x) is checked. Similarly, for (x + y), x.__add__(y) is checked first, then y.__radd__(x) is checked. (myarray > 5), myarray.__gt__(5) is checked. numpy arrays do know about ints, so that works.
However, (myarray in mylist) turns into mylist.__contains__(myarray). Only the list object is ever checked for this method. There is no paired method myarray.__rcontains__(mylist) so there is nothing that numpy can override to make this operation do anything different from what lists normally do, which is check if the given object is equal to one of the items in the list.
On Sat, May 28, 2011 at 14:40, Michael Katz michaeladamkatz@yahoo.com wrote:
Thanks for the explanation. It sounds like Python needs __rcontains__.
Not really. For the mathematical operators, there are good ways for the operands to "know" what types they can deal with and which they can't. For mylist.__contains__(x), it should treat all objects exactly the same: check if it equals any item that it contains. There is no way for it to say, "Oh, I don't know how to deal with this type, so I'll pass it over to x.__contains__()".
A function call is the best place for this operation, not syntax.
On 5/28/2011 3:46 PM, Robert Kern wrote:
mylist.__contains__(x), it should treat all objects exactly the same: check if it equals any item that it contains. There is no way for it to say, "Oh, I don't know how to deal with this type, so I'll pass it over to x.__contains__()".
Which makes my comment redundant ...
On 5/28/2011 3:40 PM, Robert wrote:
(myarray in mylist) turns into mylist.__contains__(myarray). Only the list object is ever checked for this method. There is no paired method myarray.__rcontains__(mylist) so there is nothing that numpy can override to make this operation do anything different from what lists normally do, which is check if the given object is equal to one of the items in the list.
This seems to me to slightly miscast the problem. How would an __rcontains__ method really fix things? Would the list type check against a table of stuff that it knows how to contain? That seems horrible. And even if possible, NumPy would then have to break the rule that ``in`` tests for equality, because (I believe) the real problem in this case is that np equality testing does not return a bool. From this perspective, what is missing is not __rcontains__ (since the list already knows what to do) but rather __eeq__ for elementbyelement comparison (ideally, along with an elementbyelement operator such as say .==).
In the meantime the OP could use any(all(a==x) for x in lst)
fwiw, Alan Isaac
On 5/28/2011 3:40 PM, Robert wrote:
(myarray in mylist) turns into mylist.__contains__(myarray). Only the list object is ever checked for this method. There is no paired method myarray.__rcontains__(mylist) so there is nothing that numpy can override to make this operation do anything different from what lists normally do,
however, numpy arrays should be able to override "in" be defining their own.__contains__ method, so you could do:
something in an_array
and get a useful, vectorized result.
So I thought I'd see what currently happens when I try that:
In [24]: a Out[24]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
In [25]: 3 in a Out[25]: True
So the simple case works just like a list. But what If I want what the OP wants:
In [26]: b Out[26]: array([3, 6, 4])
In [27]: b in a Out[27]: False
OK, so the full b array is not in a, and it doesn't "vectorize" it, either. But:
In [29]: a Out[29]: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11]])
In [30]: b in a Out[30]: True
HUH?
I'm not sure by what definition we would say that b is contained in a.
but maybe..
In [41]: b Out[41]: array([ 4, 2, 345])
In [42]: b in a Out[42]: False
so it's "are all of the elements in b in a somewhere?" but only for 2d arrays?
So what does it mean?
The docstring is not helpful:
In [58]: np.ndarray.__contains__? Type: wrapper_descriptor Base Class: <type 'wrapper_descriptor'> String Form: <slot wrapper '__contains__' of 'numpy.ndarray' objects> Namespace: Interactive Docstring: x.__contains__(y) <==> y in x
If nothing useful, maybe it could provide a vectorized version of "in" for this sort of use case.
Chris
On Sun, May 29, 2011 at 10:58 PM, Chris Barker Chris.Barker@noaa.govwrote:
On 5/28/2011 3:40 PM, Robert wrote:
(myarray in mylist) turns into mylist.__contains__(myarray). Only the list object is ever checked for this method. There is no paired method myarray.__rcontains__(mylist) so there is nothing that numpy can override to make this operation do anything different from what lists normally do,
however, numpy arrays should be able to override "in" be defining their own.__contains__ method, so you could do:
something in an_array
and get a useful, vectorized result.
So I thought I'd see what currently happens when I try that:
In [24]: a Out[24]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
In [25]: 3 in a Out[25]: True
So the simple case works just like a list. But what If I want what the OP wants:
In [26]: b Out[26]: array([3, 6, 4])
In [27]: b in a Out[27]: False
OK, so the full b array is not in a, and it doesn't "vectorize" it, either. But:
In [29]: a Out[29]: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11]])
In [30]: b in a Out[30]: True
HUH?
I'm not sure by what definition we would say that b is contained in a.
but maybe..
In [41]: b Out[41]: array([ 4, 2, 345])
In [42]: b in a Out[42]: False
so it's "are all of the elements in b in a somewhere?" but only for 2d arrays?
So what does it mean?
FWIW, a short prelude on the theme seems quite promising, indeed: In []: A Out[]: array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) In []: [0, 1, 2] in A Out[]: True In []: [0, 3, 6] in A Out[]: True In []: [0, 4, 8] in A Out[]: True In []: [8, 4, 0] in A Out[]: True In []: [2, 4, 6] in A Out[]: True In []: [6, 4, 2] in A Out[]: True In []: [3, 1, 5] in A Out[]: True In [1061]: [3, 1, 4] in A Out[1061]: True But In []: [1, 2, 3] in A Out[]: False In []: [3, 2, 1] in A Out[]: True
So, obviously the logic behind __contains__ is not so very straightforward. Perhaps just a bug?
Regards, eat
The docstring is not helpful:
In [58]: np.ndarray.__contains__? Type: wrapper_descriptor Base Class: <type 'wrapper_descriptor'> String Form: <slot wrapper '__contains__' of 'numpy.ndarray' objects> Namespace: Interactive Docstring: x.__contains__(y) <==> y in x
If nothing useful, maybe it could provide a vectorized version of "in" for this sort of use case.
Chris
 Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 5266959 voice 7600 Sand Point Way NE (206) 5266329 fax Seattle, WA 98115 (206) 5266317 main reception
Chris.Barker@noaa.gov _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
Hi folks,
I've retitled this thread, as it's about a new question, now:
What does:
something in a_numpy_array
mean? i.e. how has __contains__ been defined?
A couple of us have played with it, and can't make sense of it:
In [24]: a Out[24]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
In [25]: 3 in a Out[25]: True
So the simple case works just like a list. But what If I look for an array in another array?
In [26]: b Out[26]: array([3, 6, 4])
In [27]: b in a Out[27]: False
OK, so the full b array is not in a, and it doesn't "vectorize" it, either. But:
In [29]: a Out[29]: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11]])
In [30]: b in a Out[30]: True
HUH?
I'm not sure by what definition we would say that b is contained in a.
but maybe..
In [41]: b Out[41]: array([ 4, 2, 345])
In [42]: b in a Out[42]: False
so it's "are all of the elements in b in a somewhere?" but only for 2d arrays?
So what does it mean?
The docstring is not helpful:
In [58]: np.ndarray.__contains__? Type: wrapper_descriptor Base Class: <type 'wrapper_descriptor'> String Form: <slot wrapper '__contains__' of 'numpy.ndarray' objects> Namespace: Interactive Docstring: x.__contains__(y)<==> y in x
On 5/29/11 2:50 PM, eat wrote:
FWIW, a short prelude on the theme seems quite promising, indeed: In []: A Out[]: array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) In []: [0, 1, 2] in A Out[]: True In []: [0, 3, 6] in A Out[]: True In []: [0, 4, 8] in A Out[]: True In []: [8, 4, 0] in A Out[]: True In []: [2, 4, 6] in A Out[]: True In []: [6, 4, 2] in A Out[]: True In []: [3, 1, 5] in A Out[]: True In [1061]: [3, 1, 4] in A Out[1061]: True But In []: [1, 2, 3] in A Out[]: False In []: [3, 2, 1] in A Out[]: True
So, obviously the logic behind __contains__ is not so very straightforward. Perhaps just a bug?
Chris
On Tue, May 31, 2011 at 11:25, Christopher Barker Chris.Barker@noaa.gov wrote:
Hi folks,
I've retitled this thread, as it's about a new question, now:
What does:
something in a_numpy_array
mean? i.e. how has __contains__ been defined?
A couple of us have played with it, and can't make sense of it:
In [24]: a Out[24]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
In [25]: 3 in a Out[25]: True
So the simple case works just like a list. But what If I look for an array in another array?
In [26]: b Out[26]: array([3, 6, 4])
In [27]: b in a Out[27]: False
OK, so the full b array is not in a, and it doesn't "vectorize" it, either. But:
In [29]: a Out[29]: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11]])
In [30]: b in a Out[30]: True
HUH?
I'm not sure by what definition we would say that b is contained in a.
but maybe..
In [41]: b Out[41]: array([ 4, 2, 345])
In [42]: b in a Out[42]: False
so it's "are all of the elements in b in a somewhere?" but only for 2d arrays?
It dates back to Numeric's semantics for bool(some_array), which would be True if any of the elements were nonzero. Just like any other iterable container in Python, `x in y` will essentially do
for row in y: if x == row: return True return False
Iterate along the first axis of y and compare by boolean equality. In Numeric/numpy's case, this comparison is broadcasted. So that's why [3,6,4] works, because there is one row where 3 is in the first column. [4,2,345] doesn't work because the 4 and the 2 are not in those columns.
Probably, this should be considered a mistake during the transition to numpy's semantics of having bool(some_array) raise an exception. `scalar in array` should probably work asis for an ND array, but there are several different possible semantics for `array in array` that should be explicitly spelled out, much like bool(some_array).
Michael Katz <michaeladamkatz <at> yahoo.com> writes:
Yes, thanks, np.in1d is what I needed. I didn't know how to find that.
Did you check in the documentation? If so, where did you check? Would you have found it if it was in the 'See also' section of where()?
(http://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html)
I ask because people often post to the list needing in1d() after not being able to find it via the docs, so it would be nice to add references in the places people go looking for it.
Neil
Yes, in this case I definitely would have found in1d() if it was referenced in the where() section, either as a "see also" or even better as an example where where() is combined with np.in1d():
indexes_of_interest = np.where( np.in1d( my_records.integer_field, my_values_of_interest ) )
I think the where() documentation page must be a place where a lot of people/newbies spend a lot of time. Perhaps like me they are focusing on the solution being "where() + some python stuff I already know", instead of thinking of other numpy functions, like in1d(), that might come into play.
It makes sense that in1d() is under the "Set" section. However (just to try to explain further why I didn't look and find it there), somehow I think of "set" when I am focused on having a list without duplicates. In my case I wasn't worried about duplicates, just about "I want all the guys that match any of these other guys". I did google for "numpy member", "numpy membership", "numpy in", but none led me to in1d().
Also, it's worth saying that, as a newcomer to numpy and relative newcomer to python, I often think that what I'm looking for isn't going to end up being a function with a name  often some use of slices or (fancy) indexing, or some other "pure syntax" mechanism, ends up doing what you want. So that's one reason I didn't simply scan all the available numpy function names.
________________________________ From: Neil Crighton neilcrighton@gmail.com To: numpydiscussion@scipy.org Sent: Sun, May 29, 2011 10:03:25 AM Subject: Re: [Numpydiscussion] finding elements that match any in a set
Michael Katz <michaeladamkatz <at> yahoo.com> writes:
Yes, thanks, np.in1d is what I needed. I didn't know how to find that.
Did you check in the documentation? If so, where did you check? Would you have found it if it was in the 'See also' section of where()?
(http://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html)
I ask because people often post to the list needing in1d() after not being able to find it via the docs, so it would be nice to add references in the places people go looking for it.
Neil
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
participants (8)

Alan G Isaac

Chris Barker

Christopher Barker

eat

Michael Katz

Neil Crighton

Robert Kern

Skipper Seabold