Mailman 3 Finding values in an array - NumPy-Discussion

newer
Re: [Numpy-discussion] packbits /...

Finding values in an array

Alexander Belopolsky

28 Nov 2014 28 Nov '14

3:15 a.m.

I probably miss something very basic, but how given two arrays a and b, can I find positions in a where elements of b are located? If a were sorted, I could use searchsorted, but I don't want to get valid positions for elements that are not in a. In my case, a has unique elements, but in the general case I would accept the first match. In other words, I am looking for an array analog of list.index() method.

Attachments:

attachment.htm (text/html — 442 bytes)

Show replies by date

Stephan Hoyer

28 Nov 28 Nov

4:33 a.m.

On Thu, Nov 27, 2014 at 10:15 PM, Alexander Belopolsky <ndarray@mac.com> wrote:

...

I probably miss something very basic, but how given two arrays a and b, can I find positions in a where elements of b are located? If a were sorted, I could use searchsorted, but I don't want to get valid positions for elements that are not in a. In my case, a has unique elements, but in the general case I would accept the first match. In other words, I am looking for an array analog of list.index() method.

I don't know an easy solution to this problem in pure numpy, but if you could do this pretty easily (and quite efficiently) if you are willing to use pandas. Something like: locs = pd.Index(a).get_indexer(b) Note that -1 is used to denote a non-match, and get_indexer will raise if the match is non-unique instead of returning the first element. If your array is not 1d, you can still make this work but you'll need to use np.ravel and np.unravel_index. Actually, you may find that putting your data into pandas data structures is a good solution, since pandas is designed to make exactly these sort of alignment operations easy (and automatic). I suppose the simplest solution to this problem would be to convert your data into a list and use list.index() repeatedly (or you could even write it yourself in a few lines), but I'd guess that was never implemented for ndarrays because it's rather slow -- better to use a hash-table like a dict or pandas.Index for repeated lookups.

Julian Taylor

8:22 a.m.

On 28.11.2014 04:15, Alexander Belopolsky wrote:

...

I probably miss something very basic, but how given two arrays a and b, can I find positions in a where elements of b are located? If a were sorted, I could use searchsorted, but I don't want to get valid positions for elements that are not in a. In my case, a has unique elements, but in the general case I would accept the first match. In other words, I am looking for an array analog of list.index() method.

np.where(np.in1d(a, b))

Robert Kern

8:37 a.m.

On Fri, Nov 28, 2014 at 8:22 AM, Julian Taylor < jtaylor.debian@googlemail.com> wrote:

...

On 28.11.2014 04:15, Alexander Belopolsky wrote:

...
I probably miss something very basic, but how given two arrays a and b, can I find positions in a where elements of b are located? If a were sorted, I could use searchsorted, but I don't want to get valid positions for elements that are not in a. In my case, a has unique elements, but in the general case I would accept the first match. In other words, I am looking for an array analog of list.index() method.

np.where(np.in1d(a, b))

Only if the matching elements in `b` have the same order as they do in `a`. -- Robert Kern

Julian Taylor

8:40 a.m.

On 28.11.2014 09:37, Robert Kern wrote:

...

On Fri, Nov 28, 2014 at 8:22 AM, Julian Taylor <jtaylor.debian@googlemail.com <mailto:jtaylor.debian@googlemail.com>> wrote:

...
On 28.11.2014 04:15, Alexander Belopolsky wrote:

...
I probably miss something very basic, but how given two arrays a and b, can I find positions in a where elements of b are located? If a were sorted, I could use searchsorted, but I don't want to get valid positions for elements that are not in a. In my case, a has unique elements, but in the general case I would accept the first match. In other words, I am looking for an array analog of list.index() method.

np.where(np.in1d(a, b))

Only if the matching elements in `b` have the same order as they do in `a`.

seems to work also if unordered: In [32]: a = np.arange(1000) In [33]: b = np.arange(500,550, 3) In [34]: np.random.shuffle(a) In [35]: np.random.shuffle(b) In [36]: np.where(np.in1d(a, b)) Out[36]: (array([ 0, 106, 133, 149, 238, 398, 418, 498, 533, 541, 545, 589, 634, 798, 846, 891, 965]),)

Robert Kern

9:01 a.m.

On Fri, Nov 28, 2014 at 8:40 AM, Julian Taylor < jtaylor.debian@googlemail.com> wrote:

...

On 28.11.2014 09:37, Robert Kern wrote:

...
On Fri, Nov 28, 2014 at 8:22 AM, Julian Taylor <jtaylor.debian@googlemail.com <mailto:jtaylor.debian@googlemail.com>> wrote:

...
On 28.11.2014 04:15, Alexander Belopolsky wrote:

...
I probably miss something very basic, but how given two arrays a and

b,

...
...
can I find positions in a where elements of b are located? If a were sorted, I could use searchsorted, but I don't want to get valid positions for elements that are not in a. In my case, a has unique elements, but in the general case I would accept the first match. In other words, I am looking for an array analog of list.index() method.

np.where(np.in1d(a, b))

Only if the matching elements in `b` have the same order as they do in `a`.

seems to work also if unordered: In [32]: a = np.arange(1000)

In [33]: b = np.arange(500,550, 3)

In [34]: np.random.shuffle(a)

In [35]: np.random.shuffle(b)

In [36]: np.where(np.in1d(a, b)) Out[36]: (array([ 0, 106, 133, 149, 238, 398, 418, 498, 533, 541, 545, 589, 634, 798, 846, 891, 965]),)

I meant that the OP is asking for something stricter than that. He wants this array of indices to be in the order in which those matching elements appear in `b` so that he can use this information to merge two datasets. -- Robert Kern

Nathaniel Smith

29 Nov 29 Nov

1:15 a.m.

On Fri, Nov 28, 2014 at 3:15 AM, Alexander Belopolsky <ndarray@mac.com> wrote:

...

I probably miss something very basic, but how given two arrays a and b, can I find positions in a where elements of b are located? If a were sorted, I could use searchsorted, but I don't want to get valid positions for elements that are not in a. In my case, a has unique elements, but in the general case I would accept the first match. In other words, I am looking for an array analog of list.index() method.

How about this? def index(haystack, needle): haystack = np.asarray(haystack) haystack_sort = np.argsort(haystack) haystack_sorted = haystack[haystack_sort] return haystack_sort[np.searchsorted(haystack_sorted, needle)] (Note that this will return incorrect results if any entries in needle are missing from haystack entirely. If this is a concern then you need to do some extra error-checking on the searchsorted return value.) -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

Jaime Fernández del Río

3:21 a.m.

On Fri, Nov 28, 2014 at 5:15 PM, Nathaniel Smith <njs@pobox.com> wrote:

...

On Fri, Nov 28, 2014 at 3:15 AM, Alexander Belopolsky <ndarray@mac.com> wrote:

...
I probably miss something very basic, but how given two arrays a and b, can I find positions in a where elements of b are located? If a were sorted, I could use searchsorted, but I don't want to get valid positions for elements that are not in a. In my case, a has unique elements, but in the general case I would accept the first match. In other words, I am looking for an array analog of list.index() method.

How about this?

def index(haystack, needle): haystack = np.asarray(haystack) haystack_sort = np.argsort(haystack) haystack_sorted = haystack[haystack_sort] return haystack_sort[np.searchsorted(haystack_sorted, needle)]

(Note that this will return incorrect results if any entries in needle are missing from haystack entirely. If this is a concern then you need to do some extra error-checking on the searchsorted return value.)

I like this approach a lot. You can actually skip the creation of the haystack_sorted array using the sorter kwarg: idx = haystack_sort[np.searchsorted(haystack, needle, sorter=haystack_sort)] But either using haystack_sorted or not, if any item in the needle is larger than the largest entry in the haystack, the indexing will error out with an index out of bounds. So the whole thing with proper error checking gets kind of messy, something along the lines of: sorted_idx = np.searchsorted(haystack, needle, sorter=haystack_sort) mask_idx = sorted_idx < len(haystack) idx = haystack_sort[sorted_idx[mask_idx]] mask_in_haystack = haystack[idx] == needle[mask_idx] mask_idx[mask_idx] &= mask_in_haystack So using -1 to indicate items in needle not found in haystack, you could do: ret = np.empty_like(needle, dtype=np.intp) ret[~mask_idx] = -1 ret[mask_idx] = idx[mask_in_haystack] In the end, it does get kind of messy, but I am not sure how could it be improved. Perhaps giving searchsorted an option to figure out the exact matches? Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.

Benjamin Root

3:26 a.m.

If we don't have an operation for this in numpy's setops module, it probably should be added. Ben Root On Nov 28, 2014 10:21 PM, "Jaime Fernández del Río" <jaime.frio@gmail.com> wrote:

...

On Fri, Nov 28, 2014 at 5:15 PM, Nathaniel Smith <njs@pobox.com> wrote:

...
On Fri, Nov 28, 2014 at 3:15 AM, Alexander Belopolsky <ndarray@mac.com> wrote:

...
I probably miss something very basic, but how given two arrays a and b, can I find positions in a where elements of b are located? If a were sorted, I could use searchsorted, but I don't want to get valid positions for elements that are not in a. In my case, a has unique elements, but in the general case I would accept the first match. In other words, I am looking for an array analog of list.index() method.

How about this?

def index(haystack, needle): haystack = np.asarray(haystack) haystack_sort = np.argsort(haystack) haystack_sorted = haystack[haystack_sort] return haystack_sort[np.searchsorted(haystack_sorted, needle)]

(Note that this will return incorrect results if any entries in needle are missing from haystack entirely. If this is a concern then you need to do some extra error-checking on the searchsorted return value.)

I like this approach a lot. You can actually skip the creation of the haystack_sorted array using the sorter kwarg:

idx = haystack_sort[np.searchsorted(haystack, needle, sorter=haystack_sort)]

But either using haystack_sorted or not, if any item in the needle is larger than the largest entry in the haystack, the indexing will error out with an index out of bounds. So the whole thing with proper error checking gets kind of messy, something along the lines of:

sorted_idx = np.searchsorted(haystack, needle, sorter=haystack_sort) mask_idx = sorted_idx < len(haystack) idx = haystack_sort[sorted_idx[mask_idx]] mask_in_haystack = haystack[idx] == needle[mask_idx] mask_idx[mask_idx] &= mask_in_haystack

So using -1 to indicate items in needle not found in haystack, you could do:

ret = np.empty_like(needle, dtype=np.intp) ret[~mask_idx] = -1 ret[mask_idx] = idx[mask_in_haystack]

In the end, it does get kind of messy, but I am not sure how could it be improved. Perhaps giving searchsorted an option to figure out the exact matches?

Jaime

-- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

3700

Age (days ago)

3701

Last active (days ago)

List overview

Download

8 comments

7 participants

participants (7)

Alexander Belopolsky
Benjamin Root
Jaime Fernández del Río
Julian Taylor
Nathaniel Smith
Robert Kern
Stephan Hoyer

Finding values in an array

Benjamin Root

tags

participants (7)