finding elements that match any in a set
I have a numpy array, records, with named fields including a field named "integer_field". I have an array (or list) of values of interest, and I want to get the indexes where integer_field has any of those values. Because I can do indexes = np.where( records.integer_field > 5 ) I thought I could do indexes = np.where( records.integer_field in values ) But that doesn't work. (As a side question I'm interested in why that doesn't work, when values is a python list.) How can I get those indexes? I though perhaps I need a nested np.where, or some other two-step process, but it wasn't clear to me how to do it. Thanks.
On Fri, May 27, 2011 at 12:48 PM, Michael Katz <michaeladamkatz@yahoo.com> wrote:
I have a numpy array, records, with named fields including a field named "integer_field". I have an array (or list) of values of interest, and I want to get the indexes where integer_field has any of those values.
Because I can do
indexes = np.where( records.integer_field > 5 )
I thought I could do
indexes = np.where( records.integer_field in values )
But that doesn't work. (As a side question I'm interested in why that doesn't work, when values is a python list.)
How can I get those indexes? I though perhaps I need a nested np.where, or some other two-step process, but it wasn't clear to me how to do it.
Check out this recent thread. I think the proposed class does what you want. It's more efficient than in1d, if values is small compared to the length of records. http://thread.gmane.org/gmane.comp.python.scientific.user/29035/ Skipper
On 5/27/11 9:48 AM, Michael Katz wrote:
I have a numpy array, records, with named fields including a field named "integer_field". I have an array (or list) of values of interest, and I want to get the indexes where integer_field has any of those values.
Because I can do
indexes = np.where( records.integer_field > 5 )
I thought I could do
indexes = np.where( records.integer_field in values )
But that doesn't work. (As a side question I'm interested in why that doesn't work, when values is a python list.)
that doesn't work because the python list "in" operator doesn't understand arrays -- so it is looking ot see if the entire array is in the list. actually, it doesn't even get that far: In [16]: a in l --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /Users/chris.barker/<ipython console> in <module>() ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() The ValueError results because it was decided that numpy array should not have a boolean value to avoid confusion -- i.e. is na array true whenever it is non-empty (like a list), or when all it's elements are true, or???? When I read this question, I thought -- hmmm, numpy needs something like "in", as the usual way: np.any(), would require a loop in this case. Then I read Skipper's message: On 5/27/11 9:55 AM, Skipper Seabold wrote:
Check out this recent thread. I think the proposed class does what you want. It's more efficient than in1d, if values is small compared to the length of records.
So that class may be worthwhile, but I think np.in1d is exactly what you are looking for: indexes = np.in1d( records.integer_field, values ) Funny I'd never noticed that before. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
Yes, thanks, np.in1d is what I needed. I didn't know how to find that. It still seems counterintuitive to me that indexes = np.where( records.integer_field in values ) does not work whereas indexes = np.where( records.integer_field > 5 ) does. In one case numpy is overriding the > operator; it's not checking if an array is greater than 5, but whether each element in the array is greater than 5.
From a naive user's point of view, not knowing much about the difference between and in from a python point of view, it seems like in would get overridden the same way.
________________________________ From: Christopher Barker <Chris.Barker@noaa.gov> To: Discussion of Numerical Python <numpy-discussion@scipy.org> Sent: Fri, May 27, 2011 5:48:37 PM Subject: Re: [Numpy-discussion] finding elements that match any in a set On 5/27/11 9:48 AM, Michael Katz wrote:
I have a numpy array, records, with named fields including a field named "integer_field". I have an array (or list) of values of interest, and I want to get the indexes where integer_field has any of those values.
Because I can do
indexes = np.where( records.integer_field > 5 )
I thought I could do
indexes = np.where( records.integer_field in values )
But that doesn't work. (As a side question I'm interested in why that doesn't work, when values is a python list.)
that doesn't work because the python list "in" operator doesn't understand arrays -- so it is looking ot see if the entire array is in the list. actually, it doesn't even get that far: In [16]: a in l --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /Users/chris.barker/<ipython console> in <module>() ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() The ValueError results because it was decided that numpy array should not have a boolean value to avoid confusion -- i.e. is na array true whenever it is non-empty (like a list), or when all it's elements are true, or???? When I read this question, I thought -- hmmm, numpy needs something like "in", as the usual way: np.any(), would require a loop in this case. Then I read Skipper's message: On 5/27/11 9:55 AM, Skipper Seabold wrote:
Check out this recent thread. I think the proposed class does what you want. It's more efficient than in1d, if values is small compared to the length of records.
So that class may be worthwhile, but I think np.in1d is exactly what you are looking for: indexes = np.in1d( records.integer_field, values ) Funny I'd never noticed that before. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Sat, May 28, 2011 at 14:18, Michael Katz <michaeladamkatz@yahoo.com> wrote:
Yes, thanks, np.in1d is what I needed. I didn't know how to find that.
It still seems counterintuitive to me that
indexes = np.where( records.integer_field in values )
does not work whereas
indexes = np.where( records.integer_field > 5 ) does.
In one case numpy is overriding the > operator; it's not checking if an array is greater than 5, but whether each element in the array is greater than 5.
From a naive user's point of view, not knowing much about the difference between > and in from a python point of view, it seems like in would get overridden the same way.
The Python operators are turned into special method calls on one of the objects. Most of the special methods that define the mathematical operators come in pairs: __lt__ and __gt__, __add__ and __radd__, etc. So if we have (x > y) then x.__gt__(y) is checked first. If x does not know about the type of y, then y.__lt__(x) is checked. Similarly, for (x + y), x.__add__(y) is checked first, then y.__radd__(x) is checked. (myarray > 5), myarray.__gt__(5) is checked. numpy arrays do know about ints, so that works. However, (myarray in mylist) turns into mylist.__contains__(myarray). Only the list object is ever checked for this method. There is no paired method myarray.__rcontains__(mylist) so there is nothing that numpy can override to make this operation do anything different from what lists normally do, which is check if the given object is equal to one of the items in the list. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
Thanks for the explanation. It sounds like Python needs __rcontains__. ________________________________ From: Robert Kern <robert.kern@gmail.com> To: Discussion of Numerical Python <numpy-discussion@scipy.org> Sent: Sat, May 28, 2011 12:30:05 PM Subject: Re: [Numpy-discussion] finding elements that match any in a set On Sat, May 28, 2011 at 14:18, Michael Katz <michaeladamkatz@yahoo.com> wrote:
Yes, thanks, np.in1d is what I needed. I didn't know how to find that.
It still seems counterintuitive to me that
indexes = np.where( records.integer_field in values )
does not work whereas
indexes = np.where( records.integer_field > 5 ) does.
In one case numpy is overriding the > operator; it's not checking if an array is greater than 5, but whether each element in the array is greater than 5.
From a naive user's point of view, not knowing much about the difference between > and in from a python point of view, it seems like in would get overridden the same way.
The Python operators are turned into special method calls on one of the objects. Most of the special methods that define the mathematical operators come in pairs: __lt__ and __gt__, __add__ and __radd__, etc. So if we have (x > y) then x.__gt__(y) is checked first. If x does not know about the type of y, then y.__lt__(x) is checked. Similarly, for (x + y), x.__add__(y) is checked first, then y.__radd__(x) is checked. (myarray > 5), myarray.__gt__(5) is checked. numpy arrays do know about ints, so that works. However, (myarray in mylist) turns into mylist.__contains__(myarray). Only the list object is ever checked for this method. There is no paired method myarray.__rcontains__(mylist) so there is nothing that numpy can override to make this operation do anything different from what lists normally do, which is check if the given object is equal to one of the items in the list. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Sat, May 28, 2011 at 14:40, Michael Katz <michaeladamkatz@yahoo.com> wrote:
Thanks for the explanation. It sounds like Python needs __rcontains__.
Not really. For the mathematical operators, there are good ways for the operands to "know" what types they can deal with and which they can't. For mylist.__contains__(x), it should treat all objects exactly the same: check if it equals any item that it contains. There is no way for it to say, "Oh, I don't know how to deal with this type, so I'll pass it over to x.__contains__()". A function call is the best place for this operation, not syntax. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On 5/28/2011 3:46 PM, Robert Kern wrote:
mylist.__contains__(x), it should treat all objects exactly the same: check if it equals any item that it contains. There is no way for it to say, "Oh, I don't know how to deal with this type, so I'll pass it over to x.__contains__()".
Which makes my comment redundant ...
On 5/28/2011 3:40 PM, Robert wrote:
(myarray in mylist) turns into mylist.__contains__(myarray). Only the list object is ever checked for this method. There is no paired method myarray.__rcontains__(mylist) so there is nothing that numpy can override to make this operation do anything different from what lists normally do, which is check if the given object is equal to one of the items in the list.
This seems to me to slightly miscast the problem. How would an __rcontains__ method really fix things? Would the list type check against a table of stuff that it knows how to contain? That seems horrible. And even if possible, NumPy would then have to break the rule that ``in`` tests for equality, because (I believe) the real problem in this case is that np equality testing does not return a bool. From this perspective, what is missing is not __rcontains__ (since the list already knows what to do) but rather __eeq__ for element-by-element comparison (ideally, along with an element-by-element operator such as say .==). In the meantime the OP could use any(all(a==x) for x in lst) fwiw, Alan Isaac
On 5/28/2011 3:40 PM, Robert wrote:
(myarray in mylist) turns into mylist.__contains__(myarray). Only the list object is ever checked for this method. There is no paired method myarray.__rcontains__(mylist) so there is nothing that numpy can override to make this operation do anything different from what lists normally do,
however, numpy arrays should be able to override "in" be defining their own.__contains__ method, so you could do: something in an_array and get a useful, vectorized result. So I thought I'd see what currently happens when I try that: In [24]: a Out[24]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]) In [25]: 3 in a Out[25]: True So the simple case works just like a list. But what If I want what the OP wants: In [26]: b Out[26]: array([3, 6, 4]) In [27]: b in a Out[27]: False OK, so the full b array is not in a, and it doesn't "vectorize" it, either. But: In [29]: a Out[29]: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11]]) In [30]: b in a Out[30]: True HUH? I'm not sure by what definition we would say that b is contained in a. but maybe.. In [41]: b Out[41]: array([ 4, 2, 345]) In [42]: b in a Out[42]: False so it's "are all of the elements in b in a somewhere?" but only for 2-d arrays? So what does it mean? The docstring is not helpful: In [58]: np.ndarray.__contains__? Type: wrapper_descriptor Base Class: <type 'wrapper_descriptor'> String Form: <slot wrapper '__contains__' of 'numpy.ndarray' objects> Namespace: Interactive Docstring: x.__contains__(y) <==> y in x If nothing useful, maybe it could provide a vectorized version of "in" for this sort of use case. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Sun, May 29, 2011 at 10:58 PM, Chris Barker <Chris.Barker@noaa.gov>wrote:
On 5/28/2011 3:40 PM, Robert wrote:
(myarray in mylist) turns into mylist.__contains__(myarray). Only the list object is ever checked for this method. There is no paired method myarray.__rcontains__(mylist) so there is nothing that numpy can override to make this operation do anything different from what lists normally do,
however, numpy arrays should be able to override "in" be defining their own.__contains__ method, so you could do:
something in an_array
and get a useful, vectorized result.
So I thought I'd see what currently happens when I try that:
In [24]: a Out[24]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
In [25]: 3 in a Out[25]: True
So the simple case works just like a list. But what If I want what the OP wants:
In [26]: b Out[26]: array([3, 6, 4])
In [27]: b in a Out[27]: False
OK, so the full b array is not in a, and it doesn't "vectorize" it, either. But:
In [29]: a Out[29]: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11]])
In [30]: b in a Out[30]: True
HUH?
I'm not sure by what definition we would say that b is contained in a.
but maybe..
In [41]: b Out[41]: array([ 4, 2, 345])
In [42]: b in a Out[42]: False
so it's "are all of the elements in b in a somewhere?" but only for 2-d arrays?
So what does it mean?
FWIW, a short prelude on the theme seems quite promising, indeed: In []: A Out[]: array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) In []: [0, 1, 2] in A Out[]: True In []: [0, 3, 6] in A Out[]: True In []: [0, 4, 8] in A Out[]: True In []: [8, 4, 0] in A Out[]: True In []: [2, 4, 6] in A Out[]: True In []: [6, 4, 2] in A Out[]: True In []: [3, 1, 5] in A Out[]: True In [1061]: [3, 1, 4] in A Out[1061]: True But In []: [1, 2, 3] in A Out[]: False In []: [3, 2, 1] in A Out[]: True So, obviously the logic behind __contains__ is not so very straightforward. Perhaps just a bug? Regards, eat
The docstring is not helpful:
In [58]: np.ndarray.__contains__? Type: wrapper_descriptor Base Class: <type 'wrapper_descriptor'> String Form: <slot wrapper '__contains__' of 'numpy.ndarray' objects> Namespace: Interactive Docstring: x.__contains__(y) <==> y in x
If nothing useful, maybe it could provide a vectorized version of "in" for this sort of use case.
-Chris
-- Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Hi folks, I've re-titled this thread, as it's about a new question, now: What does: something in a_numpy_array mean? i.e. how has __contains__ been defined? A couple of us have played with it, and can't make sense of it:
In [24]: a Out[24]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
In [25]: 3 in a Out[25]: True
So the simple case works just like a list. But what If I look for an array in another array?
In [26]: b Out[26]: array([3, 6, 4])
In [27]: b in a Out[27]: False
OK, so the full b array is not in a, and it doesn't "vectorize" it, either. But:
In [29]: a Out[29]: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11]])
In [30]: b in a Out[30]: True
HUH?
I'm not sure by what definition we would say that b is contained in a.
but maybe..
In [41]: b Out[41]: array([ 4, 2, 345])
In [42]: b in a Out[42]: False
so it's "are all of the elements in b in a somewhere?" but only for 2-d arrays?
So what does it mean?
The docstring is not helpful:
In [58]: np.ndarray.__contains__? Type: wrapper_descriptor Base Class: <type 'wrapper_descriptor'> String Form: <slot wrapper '__contains__' of 'numpy.ndarray' objects> Namespace: Interactive Docstring: x.__contains__(y)<==> y in x
On 5/29/11 2:50 PM, eat wrote:
FWIW, a short prelude on the theme seems quite promising, indeed: In []: A Out[]: array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) In []: [0, 1, 2] in A Out[]: True In []: [0, 3, 6] in A Out[]: True In []: [0, 4, 8] in A Out[]: True In []: [8, 4, 0] in A Out[]: True In []: [2, 4, 6] in A Out[]: True In []: [6, 4, 2] in A Out[]: True In []: [3, 1, 5] in A Out[]: True In [1061]: [3, 1, 4] in A Out[1061]: True But In []: [1, 2, 3] in A Out[]: False In []: [3, 2, 1] in A Out[]: True
So, obviously the logic behind __contains__ is not so very straightforward. Perhaps just a bug?
-Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Tue, May 31, 2011 at 11:25, Christopher Barker <Chris.Barker@noaa.gov> wrote:
Hi folks,
I've re-titled this thread, as it's about a new question, now:
What does:
something in a_numpy_array
mean? i.e. how has __contains__ been defined?
A couple of us have played with it, and can't make sense of it:
In [24]: a Out[24]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
In [25]: 3 in a Out[25]: True
So the simple case works just like a list. But what If I look for an array in another array?
In [26]: b Out[26]: array([3, 6, 4])
In [27]: b in a Out[27]: False
OK, so the full b array is not in a, and it doesn't "vectorize" it, either. But:
In [29]: a Out[29]: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11]])
In [30]: b in a Out[30]: True
HUH?
I'm not sure by what definition we would say that b is contained in a.
but maybe..
In [41]: b Out[41]: array([ 4, 2, 345])
In [42]: b in a Out[42]: False
so it's "are all of the elements in b in a somewhere?" but only for 2-d arrays?
It dates back to Numeric's semantics for bool(some_array), which would be True if any of the elements were nonzero. Just like any other iterable container in Python, `x in y` will essentially do for row in y: if x == row: return True return False Iterate along the first axis of y and compare by boolean equality. In Numeric/numpy's case, this comparison is broadcasted. So that's why [3,6,4] works, because there is one row where 3 is in the first column. [4,2,345] doesn't work because the 4 and the 2 are not in those columns. Probably, this should be considered a mistake during the transition to numpy's semantics of having bool(some_array) raise an exception. `scalar in array` should probably work as-is for an ND array, but there are several different possible semantics for `array in array` that should be explicitly spelled out, much like bool(some_array). -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
Michael Katz <michaeladamkatz <at> yahoo.com> writes:
Yes, thanks, np.in1d is what I needed. I didn't know how to find that.
Did you check in the documentation? If so, where did you check? Would you have found it if it was in the 'See also' section of where()? (http://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html) I ask because people often post to the list needing in1d() after not being able to find it via the docs, so it would be nice to add references in the places people go looking for it. Neil
Yes, in this case I definitely would have found in1d() if it was referenced in the where() section, either as a "see also" or even better as an example where where() is combined with np.in1d(): indexes_of_interest = np.where( np.in1d( my_records.integer_field, my_values_of_interest ) ) I think the where() documentation page must be a place where a lot of people/newbies spend a lot of time. Perhaps like me they are focusing on the solution being "where() + some python stuff I already know", instead of thinking of other numpy functions, like in1d(), that might come into play. It makes sense that in1d() is under the "Set" section. However (just to try to explain further why I didn't look and find it there), somehow I think of "set" when I am focused on having a list without duplicates. In my case I wasn't worried about duplicates, just about "I want all the guys that match any of these other guys". I did google for "numpy member", "numpy membership", "numpy in", but none led me to in1d(). Also, it's worth saying that, as a newcomer to numpy and relative newcomer to python, I often think that what I'm looking for isn't going to end up being a function with a name -- often some use of slices or (fancy) indexing, or some other "pure syntax" mechanism, ends up doing what you want. So that's one reason I didn't simply scan all the available numpy function names. ________________________________ From: Neil Crighton <neilcrighton@gmail.com> To: numpy-discussion@scipy.org Sent: Sun, May 29, 2011 10:03:25 AM Subject: Re: [Numpy-discussion] finding elements that match any in a set Michael Katz <michaeladamkatz <at> yahoo.com> writes:
Yes, thanks, np.in1d is what I needed. I didn't know how to find that.
Did you check in the documentation? If so, where did you check? Would you have found it if it was in the 'See also' section of where()? (http://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html) I ask because people often post to the list needing in1d() after not being able to find it via the docs, so it would be nice to add references in the places people go looking for it. Neil _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (8)
-
Alan G Isaac
-
Chris Barker
-
Christopher Barker
-
eat
-
Michael Katz
-
Neil Crighton
-
Robert Kern
-
Skipper Seabold