Hi all,
Finding topk elements is widely used in several fields, but missed in NumPy. I implement this functionality named as numpy.topk using core numpy functions and open a PR:
https://github.com/numpy/numpy/pull/19117
Any discussion are welcome.
Best wishes,
Kang Kai
On Fri, May 28, 2021 at 4:58 PM kangkai@mail.ustc.edu.cn wrote:
Hi all,
Finding topk elements is widely used in several fields, but missed in NumPy. I implement this functionality named as numpy.topk using core numpy functions and open a PR:
https://github.com/numpy/numpy/pull/19117
Any discussion are welcome.
Thanks for the proposal Kang. I think this functionality is indeed a fairly obvious gap in what Numpy offers, and would make sense to add. A detailed comparison with other libraries would be very helpful here. TensorFlow and JAX call this function `top_k`, while PyTorch, Dask and MXNet call it `topk`.
Two things to look at in more detail here are: 1. complete signatures of the function in each of those libraries, and what the commonality is there. 2. the argument Eric made on your PR about consistency with sort/argsort, and if we want topk/argtopk? Also, do other libraries have `argtopk`?
Cheers, Ralf
Best wishes,
Kang Kai _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Sat, 29 May 2021, 4:29 pm Ralf Gommers, ralf.gommers@gmail.com wrote:
On Fri, May 28, 2021 at 4:58 PM kangkai@mail.ustc.edu.cn wrote:
Hi all,
Finding topk elements is widely used in several fields, but missed in NumPy. I implement this functionality named as numpy.topk using core numpy functions and open a PR:
https://github.com/numpy/numpy/pull/19117
Any discussion are welcome.
Thanks for the proposal Kang. I think this functionality is indeed a fairly obvious gap in what Numpy offers, and would make sense to add. A detailed comparison with other libraries would be very helpful here. TensorFlow and JAX call this function `top_k`, while PyTorch, Dask and MXNet call it `topk`.
When I saw `topk` I initially parsed it as "to pk", similar to the current `tolist`. I think `top_k` is more explicit and clear.
/David
On 29/05/2021 18:33, David Menéndez Hurtado wrote:
On Sat, 29 May 2021, 4:29 pm Ralf Gommers, <ralf.gommers@gmail.com mailto:ralf.gommers@gmail.com> wrote:
On Fri, May 28, 2021 at 4:58 PM <kangkai@mail.ustc.edu.cn <mailto:kangkai@mail.ustc.edu.cn>> wrote: Hi all, Finding topk elements is widely used in several fields, but missed in NumPy. I implement this functionality named as numpy.topk using core numpy functions and open a PR: https://github.com/numpy/numpy/pull/19117 <https://github.com/numpy/numpy/pull/19117> Any discussion are welcome. Thanks for the proposal Kang. I think this functionality is indeed a fairly obvious gap in what Numpy offers, and would make sense to add. A detailed comparison with other libraries would be very helpful here. TensorFlow and JAX call this function `top_k`, while PyTorch, Dask and MXNet call it `topk`.
When I saw `topk` I initially parsed it as "to pk", similar to the current `tolist`. I think `top_k` is more explicit and clear.
What does k stand for here? As someone that never encountered this function before I find both names equally confusing. If I understand what the function is supposed to be doing, I think largest() would be much more descriptive.
Cheers, Dan
On Sat, May 29, 2021 at 3:35 PM Daniele Nicolodi daniele@grinta.net wrote:
What does k stand for here? As someone that never encountered this function before I find both names equally confusing. If I understand what the function is supposed to be doing, I think largest() would be much more descriptive.
`k` is the number of elements to return. `largest()` can connote that it's only returning the one largest value. It's fairly typical to include a dummy variable (`k` or `n`) in the name to indicate that the function lets you specify how many you want. See, for example, the stdlib `heapq` module's `nlargest()` function.
https://docs.python.org/3/library/heapq.html#heapq.nlargest
"top-k" comes from the ML community where this function is used to evaluate classification models (`k` instead of `n` being largely an accident of history, I imagine). In many classification problems, the number of classes is very large, and they are very related to each other. For example, ImageNet has a lot of different dog breeds broken out as separate classes. In order to get a more balanced view of the relative performance of the classification models, you often want to check whether the correct class is in the top 5 classes (or whatever `k` is appropriate) that the model predicted for the example, not just the one class that the model says is the most likely. "5 largest" doesn't really work in the sentences that one usually writes when talking about ML classifiers; they are talking about the 5 classes that are associated with the 5 largest values from the predictor, not the values themselves. So "top k" is what gets used in ML discussions, and that transfers over to the name of the function in ML libraries.
It is a top-down reflection of the higher level thing that people want to compute (in that context) rather than a bottom-up description of how the function is manipulating the input, if that makes sense. Either one is a valid way to name things. There is a lot to be said for numpy's domain-agnostic nature that we should prefer the bottom-up description style of naming. However, we are also in the midst of a diversifying ecosystem of array libraries, largely driven by the ML domain, and adopting some of that terminology when we try to enhance our interoperability with those libraries is also a factor to be considered.
Since this going into the top namespace, I'd also vote against the matlab-y "topk" name. And even matlab didn't do what I would expect and went with maxk
https://nl.mathworks.com/help/matlab/ref/maxk.html
I think "max_k" is a good generalization of the regular "max". Even when auto-completing, this showing up under max makes sense to me instead of searching them inside "t"s. Besides, "argmax_k" also follows suite, that of the previous convention. To my eyes this is an acceptable disturbance to an already very crowded namespace.
a few moments later....
But then again an ugly idea rears its head proposing this going into the existing max function. But I'll shut up now :)
On Sun, May 30, 2021 at 12:50 AM Robert Kern robert.kern@gmail.com wrote:
On Sat, May 29, 2021 at 3:35 PM Daniele Nicolodi daniele@grinta.net wrote:
What does k stand for here? As someone that never encountered this function before I find both names equally confusing. If I understand what the function is supposed to be doing, I think largest() would be much more descriptive.
`k` is the number of elements to return. `largest()` can connote that it's only returning the one largest value. It's fairly typical to include a dummy variable (`k` or `n`) in the name to indicate that the function lets you specify how many you want. See, for example, the stdlib `heapq` module's `nlargest()` function.
https://docs.python.org/3/library/heapq.html#heapq.nlargest
"top-k" comes from the ML community where this function is used to evaluate classification models (`k` instead of `n` being largely an accident of history, I imagine). In many classification problems, the number of classes is very large, and they are very related to each other. For example, ImageNet has a lot of different dog breeds broken out as separate classes. In order to get a more balanced view of the relative performance of the classification models, you often want to check whether the correct class is in the top 5 classes (or whatever `k` is appropriate) that the model predicted for the example, not just the one class that the model says is the most likely. "5 largest" doesn't really work in the sentences that one usually writes when talking about ML classifiers; they are talking about the 5 classes that are associated with the 5 largest values from the predictor, not the values themselves. So "top k" is what gets used in ML discussions, and that transfers over to the name of the function in ML libraries.
It is a top-down reflection of the higher level thing that people want to compute (in that context) rather than a bottom-up description of how the function is manipulating the input, if that makes sense. Either one is a valid way to name things. There is a lot to be said for numpy's domain-agnostic nature that we should prefer the bottom-up description style of naming. However, we are also in the midst of a diversifying ecosystem of array libraries, largely driven by the ML domain, and adopting some of that terminology when we try to enhance our interoperability with those libraries is also a factor to be considered.
-- Robert Kern _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
after a coffee, I don't see the point of calling it still "k" so "max_n" is my vote for what its worth.
On Sun, May 30, 2021 at 8:38 AM Ilhan Polat ilhanpolat@gmail.com wrote:
Since this going into the top namespace, I'd also vote against the matlab-y "topk" name. And even matlab didn't do what I would expect and went with maxk
https://nl.mathworks.com/help/matlab/ref/maxk.html
I think "max_k" is a good generalization of the regular "max". Even when auto-completing, this showing up under max makes sense to me instead of searching them inside "t"s. Besides, "argmax_k" also follows suite, that of the previous convention. To my eyes this is an acceptable disturbance to an already very crowded namespace.
a few moments later....
But then again an ugly idea rears its head proposing this going into the existing max function. But I'll shut up now :)
On Sun, May 30, 2021 at 12:50 AM Robert Kern robert.kern@gmail.com wrote:
On Sat, May 29, 2021 at 3:35 PM Daniele Nicolodi daniele@grinta.net wrote:
What does k stand for here? As someone that never encountered this function before I find both names equally confusing. If I understand what the function is supposed to be doing, I think largest() would be much more descriptive.
`k` is the number of elements to return. `largest()` can connote that it's only returning the one largest value. It's fairly typical to include a dummy variable (`k` or `n`) in the name to indicate that the function lets you specify how many you want. See, for example, the stdlib `heapq` module's `nlargest()` function.
https://docs.python.org/3/library/heapq.html#heapq.nlargest
"top-k" comes from the ML community where this function is used to evaluate classification models (`k` instead of `n` being largely an accident of history, I imagine). In many classification problems, the number of classes is very large, and they are very related to each other. For example, ImageNet has a lot of different dog breeds broken out as separate classes. In order to get a more balanced view of the relative performance of the classification models, you often want to check whether the correct class is in the top 5 classes (or whatever `k` is appropriate) that the model predicted for the example, not just the one class that the model says is the most likely. "5 largest" doesn't really work in the sentences that one usually writes when talking about ML classifiers; they are talking about the 5 classes that are associated with the 5 largest values from the predictor, not the values themselves. So "top k" is what gets used in ML discussions, and that transfers over to the name of the function in ML libraries.
It is a top-down reflection of the higher level thing that people want to compute (in that context) rather than a bottom-up description of how the function is manipulating the input, if that makes sense. Either one is a valid way to name things. There is a lot to be said for numpy's domain-agnostic nature that we should prefer the bottom-up description style of naming. However, we are also in the midst of a diversifying ecosystem of array libraries, largely driven by the ML domain, and adopting some of that terminology when we try to enhance our interoperability with those libraries is also a factor to be considered.
-- Robert Kern _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On 30/05/2021 00:48, Robert Kern wrote:
On Sat, May 29, 2021 at 3:35 PM Daniele Nicolodi <daniele@grinta.net mailto:daniele@grinta.net> wrote:
What does k stand for here? As someone that never encountered this function before I find both names equally confusing. If I understand what the function is supposed to be doing, I think largest() would be much more descriptive.
`k` is the number of elements to return. `largest()` can connote that it's only returning the one largest value. It's fairly typical to include a dummy variable (`k` or `n`) in the name to indicate that the function lets you specify how many you want. See, for example, the stdlib `heapq` module's `nlargest()` function.
I thought that a `largest()` function with an integer second argument could be enough self explanatory. `nlargest()` would be much more obvious to the wider audience, I think.
https://docs.python.org/3/library/heapq.html#heapq.nlargest https://docs.python.org/3/library/heapq.html#heapq.nlargest
"top-k" comes from the ML community where this function is used to evaluate classification models (`k` instead of `n` being largely an accident of history, I imagine). In many classification problems, the number of classes is very large, and they are very related to each other. For example, ImageNet has a lot of different dog breeds broken out as separate classes. In order to get a more balanced view of the relative performance of the classification models, you often want to check whether the correct class is in the top 5 classes (or whatever `k` is appropriate) that the model predicted for the example, not just the one class that the model says is the most likely. "5 largest" doesn't really work in the sentences that one usually writes when talking about ML classifiers; they are talking about the 5 classes that are associated with the 5 largest values from the predictor, not the values themselves. So "top k" is what gets used in ML discussions, and that transfers over to the name of the function in ML libraries.
It is a top-down reflection of the higher level thing that people want to compute (in that context) rather than a bottom-up description of how the function is manipulating the input, if that makes sense. Either one is a valid way to name things. There is a lot to be said for numpy's domain-agnostic nature that we should prefer the bottom-up description style of naming. However, we are also in the midst of a diversifying ecosystem of array libraries, largely driven by the ML domain, and adopting some of that terminology when we try to enhance our interoperability with those libraries is also a factor to be considered.
I think that such a simple function should be named in the most obvious way possible, or it will become one function that will be used in the domains where the unusual name makes sense, but will end being re-implemented in all other contexts. I am sure that if I would have been looking for a function that returns the N largest items in an array (being that intended accordingly to a given key function or otherwise) I would never have looked at a function named `topk()` or `top_k()` and I am pretty sure I would have discarded anything that has `k` or `top` in its name.
On the other hand, I understand that ML is where all the hipe (and a large fraction of the money) is this days, thus I understand if numpy wants to appease the crowd.
Cheers, Dan
On 29/5/21 5:28 pm, Ralf Gommers wrote:
On Fri, May 28, 2021 at 4:58 PM <kangkai@mail.ustc.edu.cn mailto:kangkai@mail.ustc.edu.cn> wrote:
Hi all, Finding topk elements is widely used in several fields, but missed in NumPy. I implement this functionality named as numpy.topk using core numpy functions and open a PR: https://github.com/numpy/numpy/pull/19117 <https://github.com/numpy/numpy/pull/19117> Any discussion are welcome.
Thanks for the proposal Kang. I think this functionality is indeed a fairly obvious gap in what Numpy offers, and would make sense to add. A detailed comparison with other libraries would be very helpful here. TensorFlow and JAX call this function `top_k`, while PyTorch, Dask and MXNet call it `topk`.
Two things to look at in more detail here are:
- complete signatures of the function in each of those libraries, and
what the commonality is there. 2. the argument Eric made on your PR about consistency with sort/argsort, and if we want topk/argtopk? Also, do other libraries have `argtopk`?
Cheers, Ralf
Best wishes, Kang Kai
Did this function come up at all in the array-API consortium dicussions?
Matti
On Sun, May 30, 2021 at 10:01 AM Matti Picus matti.picus@gmail.com wrote:
Did this function come up at all in the array-API consortium dicussions?
It happens to be in this list of functions which was made last week: https://github.com/data-apis/array-api/issues/187. That list is potential next candidates, based on them being implemented in most but not all libraries. There was no real discussion on `topk` specifically though.
The current version of the array API standard basically contains functionality that is either common to all libraries, or that NumPy has and most other libraries have as well. Given how much harder it is to get functions into NumPy than in other libraries, the "most libraries have it, NumPy does not" set of functions was not investigated much yet. That's also the reason NEP 47 doesn't have any new functions to be added to NumPy except for `from_dlpack`, but only consistency changes like adding keepdims keywords, stacking for linalg functions that are missing that, etc.
Cheers, Ralf
On Fri, May 28, 2021 at 4:58 PM <kangkai at mail.ustc.edu.cn <mailto:kangkai at mail.ustc.edu.cn>> wrote:
Hi all, Finding topk elements is widely used in several fields, but missed in NumPy. I implement this functionality named as numpy.topk using core numpy functions and open a PR: https://github.com/numpy/numpy/pull/19117 <https://github.com/numpy/numpy/pull/19117> Any discussion are welcome.
Thanks for the proposal Kang. I think this functionality is indeed a fairly obvious gap in what Numpy offers, and would make sense to add. A detailed comparison with other libraries would be very helpful here. TensorFlow and JAX call this function `top_k`, while PyTorch, Dask and MXNet call it `topk`.
Two things to look at in more detail here are:
- complete signatures of the function in each of those libraries, and
what the commonality is there. 2. the argument Eric made on your PR about consistency with sort/argsort, and if we want topk/argtopk? Also, do other libraries have `argtopk`?
Cheers, Ralf
Best wishes, Kang Kai
Hi, Thanks for reply, I present some details below:
## 1. complete signatures of the function in each of those libraries, and what the commonality is there.
| Library | Name | arg1 | arg2 | arg3 | arg4 | arg5 | |-------------|--------------------|-------|------|------|-----------|--------| | NumPy [1] | numpy.topk | a | k | axis | largest | sorted | | PyTorch [2] | torch.topk | input | k | dim | largest | sorted | | R [3] | topK | x | K | / | / | / | | MXNet [4] | mxnet.npx.topk | data | k | axis | is_ascend | / | | CNTK [5] | cntk.ops.top_k | x | k | axis | / | / | | TF [6] | tf.math.top_k | input | k | / | / | sorted | | Dask [7] | dask.array.topk | a | k | axis | -k | / | | Dask [8] | dask.array.argtopk | a | k | axis | -k | / | | MATLAB [9] | mink | A | k | dim | / | / | | MATLAB [10] | maxk | A | k | dim | / | / |
| Library | Name | Returns | |-------------|--------------------|---------------------| | NumPy [1] | numpy.topk | values, indices | | PyTorch [2] | torch.topk | values, indices | | R [3] | topK | indices | | MXNet [4] | mxnet.npx.topk | controls by ret_typ | | CNTK [5] | cntk.ops.top_k | values, indices | | TF [6] | tf.math.top_k | values, indices | | Dask [7] | dask.array.topk | values | | Dask [8] | dask.array.argtopk | indices | | MATLAB [9] | mink | values, indices | | MATLAB [10] | maxk | values, indices |
- arg1: Input array. - arg2: Number of top elements to look for along the given axis. - arg3: Axis along which to find topk. - R only supports vector, TensorFlow only supports axis=-1. - arg4: Controls whether to return k largest or smallest elements. - R, CNTK and TensorFlow only return k largest elements. - In Dask, k can be negative, which means to return k smallest elements. - In MATLAB, use two distinct functions. - arg5: If true the resulting k elements will be sorted by the values. - R, MXNet, CNTK, Dask and MATLAB only return sorted elements.
**Summary**: - Function Name: could be `topk`, `top_k`, `mink`/`maxk`. - arg1 (a), arg2 (k), arg3 (axis): should be required. - arg4 (largest), arg4 (sorted): might be discussed. - Returns: discussed below.
## 2. the argument Eric made on your PR about consistency with sort/argsort, if we want topk/argtopk? Also, do other libraries have `argtopk`
In most libraries, `topk` or `top_k` returns both values and indices, and `argtopk` is not included except for Dask. In addition, there is another inconsistency: `sort` returns ascending values, but `topk` returns descending values.
## Suggestions Finally, IMHO, new function signature might be designed as one of: I) use `topk` / `argtopk` or `top_k` / `argtop_k` ```python def topk(a, k, axis=-1, sorted=True) -> topk_values def argtopk(a, k, axis=-1, sorted=True) -> topk_indices ``` or ```python def top_k(a, k, axis=-1, sorted=True) -> topk_values def argtop_k(a, k, axis=-1, sorted=True) -> topk_indices ``` where `k` can be negative which means to return k smallest elements.
II) use `maxk` / `argmaxk` or `max_k` / `argmax_k` (`mink` / `argmink` or `min_k` / `argmin_k`) ```python def maxk(a, k, axis=-1, sorted=True) -> values def argmaxk(a, k, axis=-1, sorted=True) -> indices
def mink(a, k, axis=-1, sorted=True) -> values def argmink(a, k, axis=-1, sorted=True) -> indices ``` or ```python def max_k(a, k, axis=-1, sorted=True) -> values def argmax_k(a, k, axis=-1, sorted=True) -> indices
def min_k(a, k, axis=-1, sorted=True) -> values def argmin_k(a, k, axis=-1, sorted=True) -> indices ``` where `k` must be positive.
**References**: - [1] https://github.com/numpy/numpy/pull/19117 - [2] https://pytorch.org/docs/stable/generated/torch.topk.html - [3] https://www.rdocumentation.org/packages/tensr/versions/1.0.1/topics/topK - [4] https://mxnet.apache.org/versions/master/api/python/docs/api/npx/generated/m... - [5] https://docs.microsoft.com/en-us/python/api/cntk/cntk.ops?view=cntk-py-2.7#t... - [6] https://tensorflow.google.cn/api_docs/python/tf/math/top_k?hl=zh-cn - [7] https://docs.dask.org/en/latest/array-api.html?highlight=topk#dask.array.top... - [8] https://docs.dask.org/en/latest/array-api.html?highlight=topk#dask.array.arg... - [9] https://nl.mathworks.com/help/matlab/ref/maxk.html - [10] https://nl.mathworks.com/help/matlab/ref/mink.html
Mathematica and Julia both seem relevant here. Mma has TakeLargest (and Wolfram tends to think hard about names). https://reference.wolfram.com/language/ref/TakeLargest.html Julia's closest comparable is perhaps partialsortperm: https://docs.julialang.org/en/v1/base/sort/#Base.Sort.partialsortperm Alan Isaac
On 5/30/2021 4:40 AM, kangkai@mail.ustc.edu.cn wrote:
Hi, Thanks for reply, I present some details below:
Topk is a bad choice imo. I initially parsed it as to_pk, and had no idea what that was, although sounded a lot like a scipy signal function. Nlargest would be very obvious.
On Sun, May 30, 2021, 7:50 AM Alan G. Isaac alan.isaac@gmail.com wrote:
Mathematica and Julia both seem relevant here. Mma has TakeLargest (and Wolfram tends to think hard about names). https://reference.wolfram.com/language/ref/TakeLargest.html Julia's closest comparable is perhaps partialsortperm: https://docs.julialang.org/en/v1/base/sort/#Base.Sort.partialsortperm Alan Isaac
On 5/30/2021 4:40 AM, kangkai@mail.ustc.edu.cn wrote:
Hi, Thanks for reply, I present some details below:
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
to be honest, I read "topk" as "topeka", but I am weird. While numpy doesn't use underscores all that much, I think this is one case where it makes sense.
I'd also watch out for the use of the term "sorted", as it may mean different things to different people, particularly with regards to what its default value should be. I also find myself initially confused by the names "largest" and "sorted", especially what should they mean with the "min-k" behavior. I think Dask's use of negative k is very pythonic and would help keep the namespace clean by avoiding the extra "min_k".
As for the indices, I am of two minds. On the one hand, I don't like polluting the namespace with extra functions. On the other hand, having a function that behaves differently based on a parameter is just fugly, although we do have a function that does this - np.unique().
Ben Root
On Sun, May 30, 2021 at 8:22 AM Neal Becker ndbecker2@gmail.com wrote:
Topk is a bad choice imo. I initially parsed it as to_pk, and had no idea what that was, although sounded a lot like a scipy signal function. Nlargest would be very obvious.
On Sun, May 30, 2021, 7:50 AM Alan G. Isaac alan.isaac@gmail.com wrote:
Mathematica and Julia both seem relevant here. Mma has TakeLargest (and Wolfram tends to think hard about names). https://reference.wolfram.com/language/ref/TakeLargest.html Julia's closest comparable is perhaps partialsortperm: https://docs.julialang.org/en/v1/base/sort/#Base.Sort.partialsortperm Alan Isaac
On 5/30/2021 4:40 AM, kangkai@mail.ustc.edu.cn wrote:
Hi, Thanks for reply, I present some details below:
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Here's my opinion, as a bit of an outsider. Mainly, I understand MAX to mean the largest value in a finite totally ordered set. I understand TOP to mean the 'best' member of a finite set.
For example, on a mountain each point has a HEIGHT. There will be a MAX HEIGHT. The point(s) on the mountain that is the highest is the SUMMIT. Or in other words the TOP of the mountain. Or another example, there are TOP 40 charts for music. https://www.officialcharts.com/
To summarize, use MAX for the largest value in a totally ordered set. Use TOP when you have a height (or similar) function applied to an unordered set. The highest temperature in 2021 will occur on the hottest day(s). One is a temperature, the other a date.
I'm an outsider, and I've not made an effort to gain special knowledge about the domain prior to posting this opinion. I hope it helps. Please ignore it if it does not.
On Sun, May 30, 2021 at 10:41 AM kangkai@mail.ustc.edu.cn wrote:
On Fri, May 28, 2021 at 4:58 PM <kangkai at mail.ustc.edu.cn <mailto:kangkai at mail.ustc.edu.cn>> wrote:
Hi all, Finding topk elements is widely used in several fields, but missed in NumPy. I implement this functionality named as numpy.topk using core numpy functions and open a PR: https://github.com/numpy/numpy/pull/19117 <https://github.com/numpy/numpy/pull/19117> Any discussion are welcome.
Thanks for the proposal Kang. I think this functionality is indeed a fairly obvious gap in what Numpy offers, and would make sense to add. A detailed comparison with other libraries would be very helpful here. TensorFlow and JAX call this function `top_k`, while PyTorch, Dask and MXNet call it `topk`.
Two things to look at in more detail here are:
- complete signatures of the function in each of those libraries, and
what the commonality is there. 2. the argument Eric made on your PR about consistency with sort/argsort, and if we want topk/argtopk? Also, do other libraries have `argtopk`?
Cheers, Ralf
Best wishes, Kang Kai
Hi, Thanks for reply, I present some details below:
Thanks for the detailed investigation Kang!
## 1. complete signatures of the function in each of those libraries, and what the commonality is there.
| Library | Name | arg1 | arg2 | arg3 | arg4 | arg5 |
|-------------|--------------------|-------|------|------|-----------|--------| | NumPy [1 ] | numpy.topk | a | k | axis | largest | sorted | | PyTorch [2 ] | torch.topk | input | k | dim | largest | sorted | | R [3 ] | topK | x | K | / | / | / | | MXNet [4 ] | mxnet.npx.topk | data | k | axis | is_ascend | / | | CNTK [5 ] | cntk.ops.top_k | x | k | axis | / | / | | TF [6 ] | tf.math.top_k | input | k | / | / | sorted | | Dask [7 ] | dask.array.topk | a | k | axis | -k | / | | Dask [8 ] | dask.array.argtopk | a | k | axis | -k | / | | MATLAB [9 ] | mink | A | k | dim | / | / | | MATLAB [10 ] | maxk | A | k | dim | / | / |
| Library | Name | Returns | |-------------|--------------------|---------------------| | NumPy [1] | numpy.topk | values, indices | | PyTorch [2] | torch.topk | values, indices | | R [3] | topK | indices | | MXNet [4] | mxnet.npx.topk | controls by ret_typ | | CNTK [5] | cntk.ops.top_k | values, indices | | TF [6] | tf.math.top_k | values, indices | | Dask [7] | dask.array.topk | values | | Dask [8] | dask.array.argtopk | indices | | MATLAB [9] | mink | values, indices | | MATLAB [10] | maxk | values, indices |
- arg1: Input array.
- arg2: Number of top elements to look for along the given axis.
- arg3: Axis along which to find topk.
- R only supports vector, TensorFlow only supports axis=-1.
- arg4: Controls whether to return k largest or smallest elements.
- R, CNTK and TensorFlow only return k largest elements.
In Dask, k can be negative, which means to return k smallest elements. - In MATLAB, use two distinct functions.
- arg5: If true the resulting k elements will be sorted by the values.
- R, MXNet, CNTK, Dask and MATLAB only return sorted elements.
**Summary**:
- Function Name: could be `topk`, `top_k`, `mink`/`maxk`.
- arg1 (a), arg2 (k), arg3 (axis): should be required.
- arg4 (largest), arg4 (sorted): might be discussed.
- Returns: discussed below.
## 2. the argument Eric made on your PR about consistency with sort/argsort, if we want topk/argtopk? Also, do other libraries have `argtopk`
In most libraries, `topk` or `top_k` returns both values and indices, and `argtopk` is not included except for Dask. In addition, there is another inconsistency: `sort` returns ascending values, but `topk` returns descending values.
## Suggestions Finally, IMHO, new function signature might be designed as one of: I) use `topk` / `argtopk` or `top_k` / `argtop_k`
def topk(a, k, axis=-1, sorted=True) -> topk_values def argtopk(a, k, axis=-1, sorted=True) -> topk_indices
or
def top_k(a, k, axis=-1, sorted=True) -> topk_values def argtop_k(a, k, axis=-1, sorted=True) -> topk_indices
where `k` can be negative which means to return k smallest elements.
I don't think I'm a fan of the `-k` cleverness. Saying you want `-5` values as a stand-in for wanting the 5 smallest values is worse than a keyword imho.
It seems like commenters so far have a preference for `top_k` over `topk`, because of readability. Either way it's going to impact Dask, JAX, etc. - so it would be nice to get some input from maintainers of those libraries.
The two functions vs. returning `(values, indices)` is also a tricky choice - it may depend on usage patterns. If one needs indices a lot, then there's something to say for the tuple return. Otherwise the code is going to look like:
indices = argtop_k(x, ....) values = x[indices]
which is significantly worse than:
values, indices = top_k(x, ...)
II) use `maxk` / `argmaxk` or `max_k` / `argmax_k` (`mink` / `argmink` or `min_k` / `argmin_k`)
I suggest to forget about maxk/max_k. All Python libraries call it topk/top_k. And Matlab choosing something is usually a good reason to run in the other direction.
Cheers, Ralf
```python
def maxk(a, k, axis=-1, sorted=True) -> values def argmaxk(a, k, axis=-1, sorted=True) -> indices
def mink(a, k, axis=-1, sorted=True) -> values def argmink(a, k, axis=-1, sorted=True) -> indices
or ```python def max_k(a, k, axis=-1, sorted=True) -> values def argmax_k(a, k, axis=-1, sorted=True) -> indices def min_k(a, k, axis=-1, sorted=True) -> values def argmin_k(a, k, axis=-1, sorted=True) -> indices
where `k` must be positive.
**References**:
- [1] https://github.com/numpy/numpy/pull/19117
- [2] https://pytorch.org/docs/stable/generated/torch.topk.html
- [3]
https://www.rdocumentation.org/packages/tensr/versions/1.0.1/topics/topK
- [4]
https://mxnet.apache.org/versions/master/api/python/docs/api/npx/generated/m...
- [5]
https://docs.microsoft.com/en-us/python/api/cntk/cntk.ops?view=cntk-py-2.7#t...
https://docs.dask.org/en/latest/array-api.html?highlight=topk#dask.array.top...
- [8]
https://docs.dask.org/en/latest/array-api.html?highlight=topk#dask.array.arg...
- [9] https://nl.mathworks.com/help/matlab/ref/maxk.html
- [10] https://nl.mathworks.com/help/matlab/ref/mink.html
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion