Speeding up isin1d and adding a "method" or similar
Hi all,
there is a PR to add a faster path to `np.isin`, that uses a lookup table for all the elements that are included in the haystack (`test_elements`):
https://github.com/numpy/numpy/pull/12065/files
Such a table means that the memory overhead can be very significant, but the speedup as well, so there was the idea of adding an option to pick which version is used.
The current documentation for this new `method` keyword argument would be. So the main questions are:
* Is there any concern about adding such a new kwarg? * Is `method` the best name? Sorts uses `kind` which may also be good
There is also the smaller question of what heuristic 'auto' would use, but that can be tweaked at any time.
``` method : {'auto', 'sort', 'dictionary'}, optional The algorithm to use. This will not affect the final result, but will affect the speed. Default is 'auto'.
 If 'sort', will use a mergesortbased approach. This will have a memory usage of roughly 6 times the sum of the sizes of `ar1` and `ar2`, not accounting for size of dtypes.  If 'dictionary', will use a keydictionary approach similar to a counting sort. This is only available for boolean and integer arrays. This will have a memory usage of the size of `ar1` plus the maxmin value of `ar2`. This tends to be the faster method if the following formula is true: `log10(len(ar2)) > (log10(max(ar2)min(ar2))  2.27) / 0.927`, but may use greater memory.  If 'auto', will automatically choose the method which is expected to perform the fastest, using the above formula. For larger sizes or smaller range, 'dictionary' is chosen. For larger range or smaller sizes, 'sort' is chosen.` ```
Cheers,
Sebastian
I think this is a great idea! I don't see any downsides here.
As for the method name, I would lean towards calling it "kind" and using a default value of None for automatic selection, for consistency with np.sort.
On Thu, Jun 16, 2022 at 6:14 AM Sebastian Berg sebastian@sipsolutions.net wrote:
Hi all,
there is a PR to add a faster path to `np.isin`, that uses a lookup table for all the elements that are included in the haystack (`test_elements`):
https://github.com/numpy/numpy/pull/12065/files
Such a table means that the memory overhead can be very significant, but the speedup as well, so there was the idea of adding an option to pick which version is used.
The current documentation for this new `method` keyword argument would be. So the main questions are:
 Is there any concern about adding such a new kwarg?
 Is `method` the best name? Sorts uses `kind` which may also be good
There is also the smaller question of what heuristic 'auto' would use, but that can be tweaked at any time.
method : {'auto', 'sort', 'dictionary'}, optional The algorithm to use. This will not affect the final result, but will affect the speed. Default is 'auto'.  If 'sort', will use a mergesortbased approach. This will have a memory usage of roughly 6 times the sum of the sizes of `ar1` and `ar2`, not accounting for size of dtypes.  If 'dictionary', will use a keydictionary approach similar to a counting sort. This is only available for boolean and integer arrays. This will have a memory usage of the size of `ar1` plus the maxmin value of `ar2`. This tends to be the faster method if the following formula is true: `log10(len(ar2)) > (log10(max(ar2)min(ar2))  2.27) / 0.927`, but may use greater memory.  If 'auto', will automatically choose the method which is expected to perform the fastest, using the above formula. For larger sizes or smaller range, 'dictionary' is chosen. For larger range or smaller sizes, 'sort' is chosen.`
Cheers,
Sebastian _______________________________________________ NumPyDiscussion mailing list  numpydiscussion@python.org To unsubscribe send an email to numpydiscussionleave@python.org https://mail.python.org/mailman3/lists/numpydiscussion.python.org/ Member address: shoyer@gmail.com
On Fri, 20220617 at 08:40 0700, Stephan Hoyer wrote:
I think this is a great idea! I don't see any downsides here.
As for the method name, I would lean towards calling it "kind" and using a default value of None for automatic selection, for consistency with np.sort.
:+1:, I agree with both points. I am also not sure about the choice of "dictionary", since we don't use a dictionary (or even hashtable)? (Although I may have missed discussion on the name.)
Cheers,
Sebastian
On Thu, Jun 16, 2022 at 6:14 AM Sebastian Berg sebastian@sipsolutions.net wrote:
Hi all,
there is a PR to add a faster path to `np.isin`, that uses a look up table for all the elements that are included in the haystack (`test_elements`):
https://github.com/numpy/numpy/pull/12065/files
Such a table means that the memory overhead can be very significant, but the speedup as well, so there was the idea of adding an option to pick which version is used.
The current documentation for this new `method` keyword argument would be. So the main questions are:
 Is there any concern about adding such a new kwarg?
 Is `method` the best name? Sorts uses `kind` which may also be
good
There is also the smaller question of what heuristic 'auto' would use, but that can be tweaked at any time.
method : {'auto', 'sort', 'dictionary'}, optional The algorithm to use. This will not affect the final result, but will affect the speed. Default is 'auto'.  If 'sort', will use a mergesortbased approach. This will have a memory usage of roughly 6 times the sum of the sizes of `ar1` and `ar2`, not accounting for size of dtypes.  If 'dictionary', will use a keydictionary approach similar to a counting sort. This is only available for boolean and integer arrays. This will have a memory usage of the size of `ar1` plus the maxmin value of `ar2`. This tends to be the faster method if the following formula is true: `log10(len(ar2)) > (log10(max(ar2)min(ar2))  2.27) / 0.927`, but may use greater memory.  If 'auto', will automatically choose the method which is expected to perform the fastest, using the above formula. For larger sizes or smaller range, 'dictionary' is chosen. For larger range or smaller sizes, 'sort' is chosen.`
Cheers,
Sebastian _______________________________________________ NumPyDiscussion mailing list  numpydiscussion@python.org To unsubscribe send an email to numpydiscussionleave@python.org https://mail.python.org/mailman3/lists/numpydiscussion.python.org/ Member address: shoyer@gmail.com
NumPyDiscussion mailing list  numpydiscussion@python.org To unsubscribe send an email to numpydiscussionleave@python.org https://mail.python.org/mailman3/lists/numpydiscussion.python.org/ Member address: sebastian@sipsolutions.net
Hi all,
just a note that I merged the PR with the following semantics:
A new `kind` keywordonly argument: * `kind=None` uses a memory bound based heuristic to decide which method to use * `kind="table"` uses the new approach (integer arrays only) * `kind="sort"` forces the old behavior
The new documentation is available at: https://numpy.org/devdocs/reference/generated/numpy.in1d.html
It seems this addition should be useful in many cases, but if you have any concern about the choice of API please comment!
Cheers,
Sebastian
On Thu, 20220616 at 06:08 0700, Sebastian Berg wrote:
Hi all,
there is a PR to add a faster path to `np.isin`, that uses a lookup table for all the elements that are included in the haystack (`test_elements`):
https://github.com/numpy/numpy/pull/12065/files
Such a table means that the memory overhead can be very significant, but the speedup as well, so there was the idea of adding an option to pick which version is used.
The current documentation for this new `method` keyword argument would be. So the main questions are:
 Is there any concern about adding such a new kwarg?
 Is `method` the best name? Sorts uses `kind` which may also be
good
There is also the smaller question of what heuristic 'auto' would use, but that can be tweaked at any time.
method : {'auto', 'sort', 'dictionary'}, optional The algorithm to use. This will not affect the final result, but will affect the speed. Default is 'auto'.  If 'sort', will use a mergesortbased approach. This will have a memory usage of roughly 6 times the sum of the sizes of `ar1` and `ar2`, not accounting for size of dtypes.  If 'dictionary', will use a keydictionary approach similar to a counting sort. This is only available for boolean and integer arrays. This will have a memory usage of the size of `ar1` plus the maxmin value of `ar2`. This tends to be the faster method if the following formula is true: `log10(len(ar2)) > (log10(max(ar2)min(ar2))  2.27) / 0.927`, but may use greater memory.  If 'auto', will automatically choose the method which is expected to perform the fastest, using the above formula. For larger sizes or smaller range, 'dictionary' is chosen. For larger range or smaller sizes, 'sort' is chosen.`
Cheers,
Sebastian _______________________________________________ NumPyDiscussion mailing list  numpydiscussion@python.org To unsubscribe send an email to numpydiscussionleave@python.org https://mail.python.org/mailman3/lists/numpydiscussion.python.org/ Member address: sebastian@sipsolutions.net
participants (2)

Sebastian Berg

Stephan Hoyer