Finding values in an array
I probably miss something very basic, but how given two arrays a and b, can I find positions in a where elements of b are located? If a were sorted, I could use searchsorted, but I don't want to get valid positions for elements that are not in a. In my case, a has unique elements, but in the general case I would accept the first match. In other words, I am looking for an array analog of list.index() method.
On Thu, Nov 27, 2014 at 10:15 PM, Alexander Belopolsky <ndarray@mac.com> wrote:
I probably miss something very basic, but how given two arrays a and b, can I find positions in a where elements of b are located? If a were sorted, I could use searchsorted, but I don't want to get valid positions for elements that are not in a. In my case, a has unique elements, but in the general case I would accept the first match. In other words, I am looking for an array analog of list.index() method.
I don't know an easy solution to this problem in pure numpy, but if you could do this pretty easily (and quite efficiently) if you are willing to use pandas. Something like: locs = pd.Index(a).get_indexer(b) Note that -1 is used to denote a non-match, and get_indexer will raise if the match is non-unique instead of returning the first element. If your array is not 1d, you can still make this work but you'll need to use np.ravel and np.unravel_index. Actually, you may find that putting your data into pandas data structures is a good solution, since pandas is designed to make exactly these sort of alignment operations easy (and automatic). I suppose the simplest solution to this problem would be to convert your data into a list and use list.index() repeatedly (or you could even write it yourself in a few lines), but I'd guess that was never implemented for ndarrays because it's rather slow -- better to use a hash-table like a dict or pandas.Index for repeated lookups.
On 28.11.2014 04:15, Alexander Belopolsky wrote:
I probably miss something very basic, but how given two arrays a and b, can I find positions in a where elements of b are located? If a were sorted, I could use searchsorted, but I don't want to get valid positions for elements that are not in a. In my case, a has unique elements, but in the general case I would accept the first match. In other words, I am looking for an array analog of list.index() method.
np.where(np.in1d(a, b))
On Fri, Nov 28, 2014 at 8:22 AM, Julian Taylor < jtaylor.debian@googlemail.com> wrote:
On 28.11.2014 04:15, Alexander Belopolsky wrote:
I probably miss something very basic, but how given two arrays a and b, can I find positions in a where elements of b are located? If a were sorted, I could use searchsorted, but I don't want to get valid positions for elements that are not in a. In my case, a has unique elements, but in the general case I would accept the first match. In other words, I am looking for an array analog of list.index() method.
np.where(np.in1d(a, b))
Only if the matching elements in `b` have the same order as they do in `a`. -- Robert Kern
On 28.11.2014 09:37, Robert Kern wrote:
On Fri, Nov 28, 2014 at 8:22 AM, Julian Taylor <jtaylor.debian@googlemail.com <mailto:jtaylor.debian@googlemail.com>> wrote:
On 28.11.2014 04:15, Alexander Belopolsky wrote:
I probably miss something very basic, but how given two arrays a and b, can I find positions in a where elements of b are located? If a were sorted, I could use searchsorted, but I don't want to get valid positions for elements that are not in a. In my case, a has unique elements, but in the general case I would accept the first match. In other words, I am looking for an array analog of list.index() method.
np.where(np.in1d(a, b))
Only if the matching elements in `b` have the same order as they do in `a`.
seems to work also if unordered: In [32]: a = np.arange(1000) In [33]: b = np.arange(500,550, 3) In [34]: np.random.shuffle(a) In [35]: np.random.shuffle(b) In [36]: np.where(np.in1d(a, b)) Out[36]: (array([ 0, 106, 133, 149, 238, 398, 418, 498, 533, 541, 545, 589, 634, 798, 846, 891, 965]),)
On Fri, Nov 28, 2014 at 8:40 AM, Julian Taylor < jtaylor.debian@googlemail.com> wrote:
On 28.11.2014 09:37, Robert Kern wrote:
On Fri, Nov 28, 2014 at 8:22 AM, Julian Taylor <jtaylor.debian@googlemail.com <mailto:jtaylor.debian@googlemail.com>> wrote:
On 28.11.2014 04:15, Alexander Belopolsky wrote:
I probably miss something very basic, but how given two arrays a and
b,
can I find positions in a where elements of b are located? If a were sorted, I could use searchsorted, but I don't want to get valid positions for elements that are not in a. In my case, a has unique elements, but in the general case I would accept the first match. In other words, I am looking for an array analog of list.index() method.
np.where(np.in1d(a, b))
Only if the matching elements in `b` have the same order as they do in `a`.
seems to work also if unordered: In [32]: a = np.arange(1000)
In [33]: b = np.arange(500,550, 3)
In [34]: np.random.shuffle(a)
In [35]: np.random.shuffle(b)
In [36]: np.where(np.in1d(a, b)) Out[36]: (array([ 0, 106, 133, 149, 238, 398, 418, 498, 533, 541, 545, 589, 634, 798, 846, 891, 965]),)
I meant that the OP is asking for something stricter than that. He wants this array of indices to be in the order in which those matching elements appear in `b` so that he can use this information to merge two datasets. -- Robert Kern
On Fri, Nov 28, 2014 at 3:15 AM, Alexander Belopolsky <ndarray@mac.com> wrote:
I probably miss something very basic, but how given two arrays a and b, can I find positions in a where elements of b are located? If a were sorted, I could use searchsorted, but I don't want to get valid positions for elements that are not in a. In my case, a has unique elements, but in the general case I would accept the first match. In other words, I am looking for an array analog of list.index() method.
How about this? def index(haystack, needle): haystack = np.asarray(haystack) haystack_sort = np.argsort(haystack) haystack_sorted = haystack[haystack_sort] return haystack_sort[np.searchsorted(haystack_sorted, needle)] (Note that this will return incorrect results if any entries in needle are missing from haystack entirely. If this is a concern then you need to do some extra error-checking on the searchsorted return value.) -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
On Fri, Nov 28, 2014 at 5:15 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Fri, Nov 28, 2014 at 3:15 AM, Alexander Belopolsky <ndarray@mac.com> wrote:
I probably miss something very basic, but how given two arrays a and b, can I find positions in a where elements of b are located? If a were sorted, I could use searchsorted, but I don't want to get valid positions for elements that are not in a. In my case, a has unique elements, but in the general case I would accept the first match. In other words, I am looking for an array analog of list.index() method.
How about this?
def index(haystack, needle): haystack = np.asarray(haystack) haystack_sort = np.argsort(haystack) haystack_sorted = haystack[haystack_sort] return haystack_sort[np.searchsorted(haystack_sorted, needle)]
(Note that this will return incorrect results if any entries in needle are missing from haystack entirely. If this is a concern then you need to do some extra error-checking on the searchsorted return value.)
I like this approach a lot. You can actually skip the creation of the haystack_sorted array using the sorter kwarg: idx = haystack_sort[np.searchsorted(haystack, needle, sorter=haystack_sort)] But either using haystack_sorted or not, if any item in the needle is larger than the largest entry in the haystack, the indexing will error out with an index out of bounds. So the whole thing with proper error checking gets kind of messy, something along the lines of: sorted_idx = np.searchsorted(haystack, needle, sorter=haystack_sort) mask_idx = sorted_idx < len(haystack) idx = haystack_sort[sorted_idx[mask_idx]] mask_in_haystack = haystack[idx] == needle[mask_idx] mask_idx[mask_idx] &= mask_in_haystack So using -1 to indicate items in needle not found in haystack, you could do: ret = np.empty_like(needle, dtype=np.intp) ret[~mask_idx] = -1 ret[mask_idx] = idx[mask_in_haystack] In the end, it does get kind of messy, but I am not sure how could it be improved. Perhaps giving searchsorted an option to figure out the exact matches? Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
If we don't have an operation for this in numpy's setops module, it probably should be added. Ben Root On Nov 28, 2014 10:21 PM, "Jaime Fernández del Río" <jaime.frio@gmail.com> wrote:
On Fri, Nov 28, 2014 at 5:15 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Fri, Nov 28, 2014 at 3:15 AM, Alexander Belopolsky <ndarray@mac.com> wrote:
I probably miss something very basic, but how given two arrays a and b, can I find positions in a where elements of b are located? If a were sorted, I could use searchsorted, but I don't want to get valid positions for elements that are not in a. In my case, a has unique elements, but in the general case I would accept the first match. In other words, I am looking for an array analog of list.index() method.
How about this?
def index(haystack, needle): haystack = np.asarray(haystack) haystack_sort = np.argsort(haystack) haystack_sorted = haystack[haystack_sort] return haystack_sort[np.searchsorted(haystack_sorted, needle)]
(Note that this will return incorrect results if any entries in needle are missing from haystack entirely. If this is a concern then you need to do some extra error-checking on the searchsorted return value.)
I like this approach a lot. You can actually skip the creation of the haystack_sorted array using the sorter kwarg:
idx = haystack_sort[np.searchsorted(haystack, needle, sorter=haystack_sort)]
But either using haystack_sorted or not, if any item in the needle is larger than the largest entry in the haystack, the indexing will error out with an index out of bounds. So the whole thing with proper error checking gets kind of messy, something along the lines of:
sorted_idx = np.searchsorted(haystack, needle, sorter=haystack_sort) mask_idx = sorted_idx < len(haystack) idx = haystack_sort[sorted_idx[mask_idx]] mask_in_haystack = haystack[idx] == needle[mask_idx] mask_idx[mask_idx] &= mask_in_haystack
So using -1 to indicate items in needle not found in haystack, you could do:
ret = np.empty_like(needle, dtype=np.intp) ret[~mask_idx] = -1 ret[mask_idx] = idx[mask_in_haystack]
In the end, it does get kind of messy, but I am not sure how could it be improved. Perhaps giving searchsorted an option to figure out the exact matches?
Jaime
-- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (7)
-
Alexander Belopolsky
-
Benjamin Root
-
Jaime Fernández del Río
-
Julian Taylor
-
Nathaniel Smith
-
Robert Kern
-
Stephan Hoyer