API: make numpy.lib._arraysetops.intersect1d work on multiple arrays #25688
Dear Community, For my own work, I required the intersect1d function to work on multiple arrays while returning the indices (using `return_indizes=True`). Consequently I changed the function in numpy and now I am seeking feedback from the community. This is the corresponding PR: https://github.com/numpy/numpy/pull/25688 My motivation for the change may also apply to a larger group of people as it is important for lots of simulation data analysis: In various simulations there is often the case that many entities (particles, cells, vehicles, whatever the simulation consists of) are being tracked throughout the simulation. A typical approach is to assign a unique ID to every entity which stays constant and unique throughout the simulation and is written together with other properties of the entities on every simulation snapshot in time. Note, that during the simulation new entities may enter or leave the simulation and due to parallelization the order of those entities is not conserved. Tracking the position of entities over, lets say, 100 snapshots requires the intersection of 100 id lists instead of only two. Consequently I changed the intersect1d function from `intersect1d(ar1, ar2, assume_unique=False, return_indices=False)` to `intersect1d(*ars, assume_unique=False, return_indices=False)`. Please let me know if there is any interest in those changes -- be it in this form or another. All the Best Stephan
For my own work, I required the intersect1d function to work on multiple arrays while returning the indices (using `return_indizes=True`). Consequently I changed the function in numpy and now I am seeking feedback from the community.
This is the corresponding PR: https://github.com/numpy/numpy/pull/25688
<snip> To me this looks like a very sensible generalization. In terms of numpy API, the only real change is that, effectively, the assume_unique and return_indices arguments become keyword-only, i.e., in the unlikely case that someone passed those as positional, a trivial backward-compatible change will fix it. -- Marten
Just curious, how much faster is it compared to currently recommended `reduce` approach? DG
On 2 Feb 2024, at 17:31, Marten van Kerkwijk <mhvk@astro.utoronto.ca> wrote:
For my own work, I required the intersect1d function to work on multiple arrays while returning the indices (using `return_indizes=True`). Consequently I changed the function in numpy and now I am seeking feedback from the community.
This is the corresponding PR: https://github.com/numpy/numpy/pull/25688
<snip>
To me this looks like a very sensible generalization. In terms of numpy API, the only real change is that, effectively, the assume_unique and return_indices arguments become keyword-only, i.e., in the unlikely case that someone passed those as positional, a trivial backward-compatible change will fix it.
-- Marten _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: dom.grigonis@gmail.com
Dear Dom, just check, and on my computer the new version is ~factor 2 faster compared to the reduce approach if arrays are shuffled. For sorted arrays, the the new version is factor 3.4. faster: from functools import reduce idss = [np.random.permutation(np.arange(a*100, int(1e5)+a*100, 1)) for a in range(20)] %timeit intersect1d(*idss) # 166 +- 47ms %timeit reduce(np.intersect1d, idss) # 301 +- 3.7ms and from functools import reduce idss = [np.arange(a*100, int(1e5)+a*100, 1) for a in range(20)] %timeit intersect1d(*idss) # 77 +- 6ms %timeit reduce(np.intersect1d, idss) 212 +- 3.8ms Stephan Am 02.02.24 um 17:10 schrieb Dom Grigonis:
Just curious, how much faster is it compared to currently recommended `reduce` approach?
DG
On 2 Feb 2024, at 17:31, Marten van Kerkwijk <mhvk@astro.utoronto.ca> wrote:
For my own work, I required the intersect1d function to work on multiple arrays while returning the indices (using `return_indizes=True`). Consequently I changed the function in numpy and now I am seeking feedback from the community.
This is the corresponding PR: https://github.com/numpy/numpy/pull/25688
<snip>
To me this looks like a very sensible generalization. In terms of numpy API, the only real change is that, effectively, the assume_unique and return_indices arguments become keyword-only, i.e., in the unlikely case that someone passed those as positional, a trivial backward-compatible change will fix it.
-- Marten _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: dom.grigonis@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: stephan.kuschel@gmail.com
Also, I don’t know if this could be of value, but my use case for this is to find overlaps, then split arrays into overlapping and non-overlapping segments. Thus, it might be useful for `return_indices=True` to return indices of all instances, not only the first. Also, in my case I need both overlapping and non-overlapping indices, but this would become ambiguous with more than 2 arrays. If it was left with 2 array input, then it can be extended to return both overlapping and non-overlapping parts. I think it could be another potential path to consider. E.g. what would be the speed comparison: intr = intersect1d(arr1, arr2, assume_unique=False) intr = intersect1d(intr, np.unique(arr3), assume_unique=True) # VS new intr = intersect1d(arr1, arr2, arr3, assume_unique=False) Then, does the gain from such generalisation justify constriction it introduces? Regards, DG
On 2 Feb 2024, at 17:31, Marten van Kerkwijk <mhvk@astro.utoronto.ca> wrote:
For my own work, I required the intersect1d function to work on multiple arrays while returning the indices (using `return_indizes=True`). Consequently I changed the function in numpy and now I am seeking feedback from the community.
This is the corresponding PR: https://github.com/numpy/numpy/pull/25688
<snip>
To me this looks like a very sensible generalization. In terms of numpy API, the only real change is that, effectively, the assume_unique and return_indices arguments become keyword-only, i.e., in the unlikely case that someone passed those as positional, a trivial backward-compatible change will fix it.
-- Marten _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: dom.grigonis@gmail.com
Dear Dom, thanks for bringing up the possible constriction. I agree that this would be serious argument against the change. However, as you said the overlapping/non-overlapping indices would become ambiguous with more than two arrays. And calling the fucntion with only two arrays at a time would still be possible. So we will be unable to generalize in the future towards a problem, that only has ambinuous solutions. So I fail to see what exactly we the other use case would be. The point of this change is not the luxory of allowing multiple arrays to calculate the intersection. Its all about getting the indices in the original arrays, using `return_indices=True`. All the Best Stephan Am 02.02.24 um 17:36 schrieb Dom Grigonis:
Also, I don’t know if this could be of value, but my use case for this is to find overlaps, then split arrays into overlapping and non-overlapping segments.
Thus, it might be useful for `return_indices=True` to return indices of all instances, not only the first.
Also, in my case I need both overlapping and non-overlapping indices, but this would become ambiguous with more than 2 arrays.
If it was left with 2 array input, then it can be extended to return both overlapping and non-overlapping parts. I think it could be another potential path to consider.
E.g. what would be the speed comparison:
intr= intersect1d(arr1, arr2, assume_unique=False) intr= intersect1d(intr, np.unique(arr3), assume_unique=True)
# VS new
intr= intersect1d(arr1, arr2, arr3, assume_unique=False)
Then, does the gain from such generalisation justify constriction it introduces?
Regards, DG
On 2 Feb 2024, at 17:31, Marten van Kerkwijk <mhvk@astro.utoronto.ca <mailto:mhvk@astro.utoronto.ca>> wrote:
For my own work, I required the intersect1d function to work on multiple arrays while returning the indices (using `return_indizes=True`). Consequently I changed the function in numpy and now I am seeking feedback from the community.
This is the corresponding PR: https://github.com/numpy/numpy/pull/25688 <https://github.com/numpy/numpy/pull/25688>
<snip>
To me this looks like a very sensible generalization. In terms of numpy API, the only real change is that, effectively, the assume_unique and return_indices arguments become keyword-only, i.e., in the unlikely case that someone passed those as positional, a trivial backward-compatible change will fix it.
-- Marten _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org <mailto:numpy-discussion@python.org> To unsubscribe send an email to numpy-discussion-leave@python.org <mailto:numpy-discussion-leave@python.org> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/> Member address: dom.grigonis@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: stephan.kuschel@gmail.com
Hi Stephan, Thanks for replying to my concerns. I have looked a bit more into it and `intersect1d` is not the right concept for my problem anyway. And it is great that it works faster than reduce approach. Also, I did more speed tests and, after all, `in1d` approach is not generally faster. It is faster in some cases and slower in others. Also, it uses more memory in all cases. Regards, DG
On 5 Feb 2024, at 14:30, Stephan Kuschel via NumPy-Discussion <numpy-discussion@python.org> wrote:
Dear Dom,
thanks for bringing up the possible constriction. I agree that this would be serious argument against the change.
However, as you said the overlapping/non-overlapping indices would become ambiguous with more than two arrays. And calling the fucntion with only two arrays at a time would still be possible. So we will be unable to generalize in the future towards a problem, that only has ambinuous solutions. So I fail to see what exactly we the other use case would be.
The point of this change is not the luxory of allowing multiple arrays to calculate the intersection. Its all about getting the indices in the original arrays, using `return_indices=True`.
All the Best Stephan
Am 02.02.24 um 17:36 schrieb Dom Grigonis:
Also, I don’t know if this could be of value, but my use case for this is to find overlaps, then split arrays into overlapping and non-overlapping segments. Thus, it might be useful for `return_indices=True` to return indices of all instances, not only the first. Also, in my case I need both overlapping and non-overlapping indices, but this would become ambiguous with more than 2 arrays. If it was left with 2 array input, then it can be extended to return both overlapping and non-overlapping parts. I think it could be another potential path to consider. E.g. what would be the speed comparison: intr= intersect1d(arr1, arr2, assume_unique=False) intr= intersect1d(intr, np.unique(arr3), assume_unique=True) # VS new intr= intersect1d(arr1, arr2, arr3, assume_unique=False) Then, does the gain from such generalisation justify constriction it introduces? Regards, DG
On 2 Feb 2024, at 17:31, Marten van Kerkwijk <mhvk@astro.utoronto.ca <mailto:mhvk@astro.utoronto.ca><mailto:mhvk@astro.utoronto.ca <mailto:mhvk@astro.utoronto.ca>>> wrote:
For my own work, I required the intersect1d function to work on multiple arrays while returning the indices (using `return_indizes=True`). Consequently I changed the function in numpy and now I am seeking feedback from the community.
This is the corresponding PR: https://github.com/numpy/numpy/pull/25688 <https://github.com/numpy/numpy/pull/25688><https://github.com/numpy/numpy/pull/25688 <https://github.com/numpy/numpy/pull/25688>>
<snip>
To me this looks like a very sensible generalization. In terms of numpy API, the only real change is that, effectively, the assume_unique and return_indices arguments become keyword-only, i.e., in the unlikely case that someone passed those as positional, a trivial backward-compatible change will fix it.
-- Marten _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org <mailto:numpy-discussion@python.org> <mailto:numpy-discussion@python.org <mailto:numpy-discussion@python.org>> To unsubscribe send an email to numpy-discussion-leave@python.org <mailto:numpy-discussion-leave@python.org><mailto:numpy-discussion-leave@python.org <mailto:numpy-discussion-leave@python.org>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/><https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>> Member address: dom.grigonis@gmail.com <mailto:dom.grigonis@gmail.com>
NumPy-Discussion mailing list -- numpy-discussion@python.org <mailto:numpy-discussion@python.org> To unsubscribe send an email to numpy-discussion-leave@python.org <mailto:numpy-discussion-leave@python.org> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/> Member address: stephan.kuschel@gmail.com <mailto:stephan.kuschel@gmail.com>
NumPy-Discussion mailing list -- numpy-discussion@python.org <mailto:numpy-discussion@python.org> To unsubscribe send an email to numpy-discussion-leave@python.org <mailto:numpy-discussion-leave@python.org> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/> Member address: dom.grigonis@gmail.com <mailto:dom.grigonis@gmail.com>
On Fri, Feb 2, 2024 at 6:34 AM Stephan Kuschel via NumPy-Discussion < numpy-discussion@python.org> wrote:
All the Best
Stephan
_______________________________________________ NumPy-Discussion mailing li
Dear Community,
For my own work, I required the intersect1d function to work on multiple arrays while returning the indices (using `return_indizes=True`). Consequently I changed the function in numpy and now I am seeking feedback from the community.
This is the corresponding PR: https://github.com/numpy/numpy/pull/25688
My motivation for the change may also apply to a larger group of people as it is important for lots of simulation data analysis:
In various simulations there is often the case that many entities (particles, cells, vehicles, whatever the simulation consists of) are being tracked throughout the simulation. A typical approach is to assign a unique ID to every entity which stays constant and unique throughout the simulation and is written together with other properties of the entities on every simulation snapshot in time. Note, that during the simulation new entities may enter or leave the simulation and due to parallelization the order of those entities is not conserved. Tracking the position of entities over, lets say, 100 snapshots requires the intersection of 100 id lists instead of only two.
Consequently I changed the intersect1d function from `intersect1d(ar1, ar2, assume_unique=False, return_indices=False)` to `intersect1d(*ars, assume_unique=False, return_indices=False)`.
Please let me know if there is any interest in those changes -- be it in this form or another.
Seems reasonable. I don't know if it is faster, but NumPy is all about vectorization. Chuck
Seems reasonable. I don't know if it is faster, but NumPy is all about vectorization. Surely speed needs to fit into equation somehow?
Another point to consider. Is function `in1d` able to achieve everything that `intersect1d` does? For 2 arrays of length 1600. def intersect1d_via_in1d(x, y): return np.unique(r1[np.in1d(x, y)]) %timeit np.intersect1d(r1, r2) # 927 µs ± 10.2 %timeit intersect1d_via_in1d(r1, r2) # 163 µs ± 1.73 Retrieving indices and other outputs that `intersect1d` provides also seems trivial. If it is faster and doesn’t consume more memory, then maybe it is more worthwhile re-using it in `intersect1d`? Regards, DG
participants (4)
-
Charles R Harris
-
Dom Grigonis
-
Marten van Kerkwijk
-
Stephan Kuschel