Mailman 3 EHN: Discusions about 'add numpy.topk' - NumPy-Discussion

EHN: Discusions about 'add numpy.topk'

kangkai＠mail.ustc.edu.cn

28 May 2021 28 May '21

7:57 a.m.

Hi all, Finding topk elements is widely used in several fields, but missed in NumPy. I implement this functionality named as numpy.topk using core numpy functions and open a PR: https://github.com/numpy/numpy/pull/19117 Any discussion are welcome. Best wishes, Kang Kai

Attachments:

attachment.htm (text/html — 1.0 KB)

Show replies by date

Ralf Gommers

29 May 29 May

7:28 a.m.

On Fri, May 28, 2021 at 4:58 PM <kangkai@mail.ustc.edu.cn> wrote:

...

Hi all,

Finding topk elements is widely used in several fields, but missed in NumPy. I implement this functionality named as numpy.topk using core numpy functions and open a PR:

https://github.com/numpy/numpy/pull/19117

Any discussion are welcome.

Thanks for the proposal Kang. I think this functionality is indeed a fairly obvious gap in what Numpy offers, and would make sense to add. A detailed comparison with other libraries would be very helpful here. TensorFlow and JAX call this function `top_k`, while PyTorch, Dask and MXNet call it `topk`. Two things to look at in more detail here are: 1. complete signatures of the function in each of those libraries, and what the commonality is there. 2. the argument Eric made on your PR about consistency with sort/argsort, and if we want topk/argtopk? Also, do other libraries have `argtopk`? Cheers, Ralf

...

Best wishes,

Kang Kai _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

David Menéndez Hurtado

9:33 a.m.

On Sat, 29 May 2021, 4:29 pm Ralf Gommers, <ralf.gommers@gmail.com> wrote:

...

On Fri, May 28, 2021 at 4:58 PM <kangkai@mail.ustc.edu.cn> wrote:

...
Hi all,

Finding topk elements is widely used in several fields, but missed in NumPy. I implement this functionality named as numpy.topk using core numpy functions and open a PR:

https://github.com/numpy/numpy/pull/19117

Any discussion are welcome.

Thanks for the proposal Kang. I think this functionality is indeed a fairly obvious gap in what Numpy offers, and would make sense to add. A detailed comparison with other libraries would be very helpful here. TensorFlow and JAX call this function `top_k`, while PyTorch, Dask and MXNet call it `topk`.

When I saw `topk` I initially parsed it as "to pk", similar to the current `tolist`. I think `top_k` is more explicit and clear. /David

Daniele Nicolodi

12:26 p.m.

On 29/05/2021 18:33, David Menéndez Hurtado wrote:

...

On Sat, 29 May 2021, 4:29 pm Ralf Gommers, <ralf.gommers@gmail.com <mailto:ralf.gommers@gmail.com>> wrote:

On Fri, May 28, 2021 at 4:58 PM <kangkai@mail.ustc.edu.cn <mailto:kangkai@mail.ustc.edu.cn>> wrote:

Hi all,

Finding topk elements is widely used in several fields, but missed in NumPy. I implement this functionality named as numpy.topk using core numpy functions and open a PR:

https://github.com/numpy/numpy/pull/19117 <https://github.com/numpy/numpy/pull/19117>

Any discussion are welcome.

Thanks for the proposal Kang. I think this functionality is indeed a fairly obvious gap in what Numpy offers, and would make sense to add. A detailed comparison with other libraries would be very helpful here. TensorFlow and JAX call this function `top_k`, while PyTorch, Dask and MXNet call it `topk`.

When I saw `topk` I initially parsed it as "to pk", similar to the current `tolist`. I think `top_k` is more explicit and clear.

What does k stand for here? As someone that never encountered this function before I find both names equally confusing. If I understand what the function is supposed to be doing, I think largest() would be much more descriptive. Cheers, Dan

Robert Kern

3:48 p.m.

On Sat, May 29, 2021 at 3:35 PM Daniele Nicolodi <daniele@grinta.net> wrote:

...

What does k stand for here? As someone that never encountered this function before I find both names equally confusing. If I understand what the function is supposed to be doing, I think largest() would be much more descriptive.

`k` is the number of elements to return. `largest()` can connote that it's only returning the one largest value. It's fairly typical to include a dummy variable (`k` or `n`) in the name to indicate that the function lets you specify how many you want. See, for example, the stdlib `heapq` module's `nlargest()` function. https://docs.python.org/3/library/heapq.html#heapq.nlargest "top-k" comes from the ML community where this function is used to evaluate classification models (`k` instead of `n` being largely an accident of history, I imagine). In many classification problems, the number of classes is very large, and they are very related to each other. For example, ImageNet has a lot of different dog breeds broken out as separate classes. In order to get a more balanced view of the relative performance of the classification models, you often want to check whether the correct class is in the top 5 classes (or whatever `k` is appropriate) that the model predicted for the example, not just the one class that the model says is the most likely. "5 largest" doesn't really work in the sentences that one usually writes when talking about ML classifiers; they are talking about the 5 classes that are associated with the 5 largest values from the predictor, not the values themselves. So "top k" is what gets used in ML discussions, and that transfers over to the name of the function in ML libraries. It is a top-down reflection of the higher level thing that people want to compute (in that context) rather than a bottom-up description of how the function is manipulating the input, if that makes sense. Either one is a valid way to name things. There is a lot to be said for numpy's domain-agnostic nature that we should prefer the bottom-up description style of naming. However, we are also in the midst of a diversifying ecosystem of array libraries, largely driven by the ML domain, and adopting some of that terminology when we try to enhance our interoperability with those libraries is also a factor to be considered. -- Robert Kern

Ilhan Polat

11:38 p.m.

Since this going into the top namespace, I'd also vote against the matlab-y "topk" name. And even matlab didn't do what I would expect and went with maxk https://nl.mathworks.com/help/matlab/ref/maxk.html I think "max_k" is a good generalization of the regular "max". Even when auto-completing, this showing up under max makes sense to me instead of searching them inside "t"s. Besides, "argmax_k" also follows suite, that of the previous convention. To my eyes this is an acceptable disturbance to an already very crowded namespace. a few moments later.... But then again an ugly idea rears its head proposing this going into the existing max function. But I'll shut up now :) On Sun, May 30, 2021 at 12:50 AM Robert Kern <robert.kern@gmail.com> wrote:

...

On Sat, May 29, 2021 at 3:35 PM Daniele Nicolodi <daniele@grinta.net> wrote:

...
What does k stand for here? As someone that never encountered this function before I find both names equally confusing. If I understand what the function is supposed to be doing, I think largest() would be much more descriptive.

`k` is the number of elements to return. `largest()` can connote that it's only returning the one largest value. It's fairly typical to include a dummy variable (`k` or `n`) in the name to indicate that the function lets you specify how many you want. See, for example, the stdlib `heapq` module's `nlargest()` function.

https://docs.python.org/3/library/heapq.html#heapq.nlargest

"top-k" comes from the ML community where this function is used to evaluate classification models (`k` instead of `n` being largely an accident of history, I imagine). In many classification problems, the number of classes is very large, and they are very related to each other. For example, ImageNet has a lot of different dog breeds broken out as separate classes. In order to get a more balanced view of the relative performance of the classification models, you often want to check whether the correct class is in the top 5 classes (or whatever `k` is appropriate) that the model predicted for the example, not just the one class that the model says is the most likely. "5 largest" doesn't really work in the sentences that one usually writes when talking about ML classifiers; they are talking about the 5 classes that are associated with the 5 largest values from the predictor, not the values themselves. So "top k" is what gets used in ML discussions, and that transfers over to the name of the function in ML libraries.

It is a top-down reflection of the higher level thing that people want to compute (in that context) rather than a bottom-up description of how the function is manipulating the input, if that makes sense. Either one is a valid way to name things. There is a lot to be said for numpy's domain-agnostic nature that we should prefer the bottom-up description style of naming. However, we are also in the midst of a diversifying ecosystem of array libraries, largely driven by the ML domain, and adopting some of that terminology when we try to enhance our interoperability with those libraries is also a factor to be considered.

-- Robert Kern _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Ilhan Polat

30 May 30 May

12:10 a.m.

after a coffee, I don't see the point of calling it still "k" so "max_n" is my vote for what its worth. On Sun, May 30, 2021 at 8:38 AM Ilhan Polat <ilhanpolat@gmail.com> wrote:

...

Since this going into the top namespace, I'd also vote against the matlab-y "topk" name. And even matlab didn't do what I would expect and went with maxk

https://nl.mathworks.com/help/matlab/ref/maxk.html

I think "max_k" is a good generalization of the regular "max". Even when auto-completing, this showing up under max makes sense to me instead of searching them inside "t"s. Besides, "argmax_k" also follows suite, that of the previous convention. To my eyes this is an acceptable disturbance to an already very crowded namespace.

a few moments later....

But then again an ugly idea rears its head proposing this going into the existing max function. But I'll shut up now :)

On Sun, May 30, 2021 at 12:50 AM Robert Kern <robert.kern@gmail.com> wrote:

...
On Sat, May 29, 2021 at 3:35 PM Daniele Nicolodi <daniele@grinta.net> wrote:

...
What does k stand for here? As someone that never encountered this function before I find both names equally confusing. If I understand what the function is supposed to be doing, I think largest() would be much more descriptive.

`k` is the number of elements to return. `largest()` can connote that it's only returning the one largest value. It's fairly typical to include a dummy variable (`k` or `n`) in the name to indicate that the function lets you specify how many you want. See, for example, the stdlib `heapq` module's `nlargest()` function.

https://docs.python.org/3/library/heapq.html#heapq.nlargest

"top-k" comes from the ML community where this function is used to evaluate classification models (`k` instead of `n` being largely an accident of history, I imagine). In many classification problems, the number of classes is very large, and they are very related to each other. For example, ImageNet has a lot of different dog breeds broken out as separate classes. In order to get a more balanced view of the relative performance of the classification models, you often want to check whether the correct class is in the top 5 classes (or whatever `k` is appropriate) that the model predicted for the example, not just the one class that the model says is the most likely. "5 largest" doesn't really work in the sentences that one usually writes when talking about ML classifiers; they are talking about the 5 classes that are associated with the 5 largest values from the predictor, not the values themselves. So "top k" is what gets used in ML discussions, and that transfers over to the name of the function in ML libraries.

It is a top-down reflection of the higher level thing that people want to compute (in that context) rather than a bottom-up description of how the function is manipulating the input, if that makes sense. Either one is a valid way to name things. There is a lot to be said for numpy's domain-agnostic nature that we should prefer the bottom-up description style of naming. However, we are also in the midst of a diversifying ecosystem of array libraries, largely driven by the ML domain, and adopting some of that terminology when we try to enhance our interoperability with those libraries is also a factor to be considered.

-- Robert Kern _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Alan G. Isaac

4:23 a.m.

Is there any thought of allowing for other comparisons? In which case `last_k` might be preferable. Alan Isaac On 5/30/2021 2:38 AM, Ilhan Polat wrote:

...

I think "max_k" is a good generalization of the regular "max".

Daniele Nicolodi

1:10 a.m.

On 30/05/2021 00:48, Robert Kern wrote:

...

On Sat, May 29, 2021 at 3:35 PM Daniele Nicolodi <daniele@grinta.net <mailto:daniele@grinta.net>> wrote:

What does k stand for here? As someone that never encountered this function before I find both names equally confusing. If I understand what the function is supposed to be doing, I think largest() would be much more descriptive.

`k` is the number of elements to return. `largest()` can connote that it's only returning the one largest value. It's fairly typical to include a dummy variable (`k` or `n`) in the name to indicate that the function lets you specify how many you want. See, for example, the stdlib `heapq` module's `nlargest()` function.

I thought that a `largest()` function with an integer second argument could be enough self explanatory. `nlargest()` would be much more obvious to the wider audience, I think.

...

https://docs.python.org/3/library/heapq.html#heapq.nlargest <https://docs.python.org/3/library/heapq.html#heapq.nlargest>

"top-k" comes from the ML community where this function is used to evaluate classification models (`k` instead of `n` being largely an accident of history, I imagine). In many classification problems, the number of classes is very large, and they are very related to each other. For example, ImageNet has a lot of different dog breeds broken out as separate classes. In order to get a more balanced view of the relative performance of the classification models, you often want to check whether the correct class is in the top 5 classes (or whatever `k` is appropriate) that the model predicted for the example, not just the one class that the model says is the most likely. "5 largest" doesn't really work in the sentences that one usually writes when talking about ML classifiers; they are talking about the 5 classes that are associated with the 5 largest values from the predictor, not the values themselves. So "top k" is what gets used in ML discussions, and that transfers over to the name of the function in ML libraries.

It is a top-down reflection of the higher level thing that people want to compute (in that context) rather than a bottom-up description of how the function is manipulating the input, if that makes sense. Either one is a valid way to name things. There is a lot to be said for numpy's domain-agnostic nature that we should prefer the bottom-up description style of naming. However, we are also in the midst of a diversifying ecosystem of array libraries, largely driven by the ML domain, and adopting some of that terminology when we try to enhance our interoperability with those libraries is also a factor to be considered.

I think that such a simple function should be named in the most obvious way possible, or it will become one function that will be used in the domains where the unusual name makes sense, but will end being re-implemented in all other contexts. I am sure that if I would have been looking for a function that returns the N largest items in an array (being that intended accordingly to a given key function or otherwise) I would never have looked at a function named `topk()` or `top_k()` and I am pretty sure I would have discarded anything that has `k` or `top` in its name. On the other hand, I understand that ML is where all the hipe (and a large fraction of the money) is this days, thus I understand if numpy wants to appease the crowd. Cheers, Dan

Matti Picus

1 a.m.

On 29/5/21 5:28 pm, Ralf Gommers wrote:

...

On Fri, May 28, 2021 at 4:58 PM <kangkai@mail.ustc.edu.cn <mailto:kangkai@mail.ustc.edu.cn>> wrote:

Hi all,

Finding topk elements is widely used in several fields, but missed in NumPy. I implement this functionality named as numpy.topk using core numpy functions and open a PR:

https://github.com/numpy/numpy/pull/19117 <https://github.com/numpy/numpy/pull/19117>

Any discussion are welcome.

Thanks for the proposal Kang. I think this functionality is indeed a fairly obvious gap in what Numpy offers, and would make sense to add. A detailed comparison with other libraries would be very helpful here. TensorFlow and JAX call this function `top_k`, while PyTorch, Dask and MXNet call it `topk`.

Two things to look at in more detail here are: 1. complete signatures of the function in each of those libraries, and what the commonality is there. 2. the argument Eric made on your PR about consistency with sort/argsort, and if we want topk/argtopk? Also, do other libraries have `argtopk`?

Cheers, Ralf

Best wishes,

Kang Kai

Did this function come up at all in the array-API consortium dicussions? Matti

Ralf Gommers

31 May 31 May

9:26 a.m.

On Sun, May 30, 2021 at 10:01 AM Matti Picus <matti.picus@gmail.com> wrote:

...

Did this function come up at all in the array-API consortium dicussions?

It happens to be in this list of functions which was made last week: https://github.com/data-apis/array-api/issues/187. That list is potential next candidates, based on them being implemented in most but not all libraries. There was no real discussion on `topk` specifically though. The current version of the array API standard basically contains functionality that is either common to all libraries, or that NumPy has and most other libraries have as well. Given how much harder it is to get functions into NumPy than in other libraries, the "most libraries have it, NumPy does not" set of functions was not investigated much yet. That's also the reason NEP 47 doesn't have any new functions to be added to NumPy except for `from_dlpack`, but only consistency changes like adding keepdims keywords, stacking for linalg functions that are missing that, etc. Cheers, Ralf

kangkai＠mail.ustc.edu.cn

30 May 30 May

1:40 a.m.

...

On Fri, May 28, 2021 at 4:58 PM <kangkai at mail.ustc.edu.cn <mailto:kangkai at mail.ustc.edu.cn>> wrote:

Hi all,

Finding topk elements is widely used in several fields, but missed in NumPy. I implement this functionality named as numpy.topk using core numpy functions and open a PR:

https://github.com/numpy/numpy/pull/19117 <https://github.com/numpy/numpy/pull/19117>

Any discussion are welcome.

Thanks for the proposal Kang. I think this functionality is indeed a fairly obvious gap in what Numpy offers, and would make sense to add. A detailed comparison with other libraries would be very helpful here. TensorFlow and JAX call this function `top_k`, while PyTorch, Dask and MXNet call it `topk`.

Two things to look at in more detail here are: 1. complete signatures of the function in each of those libraries, and what the commonality is there. 2. the argument Eric made on your PR about consistency with sort/argsort, and if we want topk/argtopk? Also, do other libraries have `argtopk`?

Cheers, Ralf

Best wishes,

Kang Kai

Hi, Thanks for reply, I present some details below: ## 1. complete signatures of the function in each of those libraries, and what the commonality is there. | Library | Name | arg1 | arg2 | arg3 | arg4 | arg5 | |-------------|--------------------|-------|------|------|-----------|--------| | NumPy [1] | numpy.topk | a | k | axis | largest | sorted | | PyTorch [2] | torch.topk | input | k | dim | largest | sorted | | R [3] | topK | x | K | / | / | / | | MXNet [4] | mxnet.npx.topk | data | k | axis | is_ascend | / | | CNTK [5] | cntk.ops.top_k | x | k | axis | / | / | | TF [6] | tf.math.top_k | input | k | / | / | sorted | | Dask [7] | dask.array.topk | a | k | axis | -k | / | | Dask [8] | dask.array.argtopk | a | k | axis | -k | / | | MATLAB [9] | mink | A | k | dim | / | / | | MATLAB [10] | maxk | A | k | dim | / | / | | Library | Name | Returns | |-------------|--------------------|---------------------| | NumPy [1] | numpy.topk | values, indices | | PyTorch [2] | torch.topk | values, indices | | R [3] | topK | indices | | MXNet [4] | mxnet.npx.topk | controls by ret_typ | | CNTK [5] | cntk.ops.top_k | values, indices | | TF [6] | tf.math.top_k | values, indices | | Dask [7] | dask.array.topk | values | | Dask [8] | dask.array.argtopk | indices | | MATLAB [9] | mink | values, indices | | MATLAB [10] | maxk | values, indices | - arg1: Input array. - arg2: Number of top elements to look for along the given axis. - arg3: Axis along which to find topk. - R only supports vector, TensorFlow only supports axis=-1. - arg4: Controls whether to return k largest or smallest elements. - R, CNTK and TensorFlow only return k largest elements. - In Dask, k can be negative, which means to return k smallest elements. - In MATLAB, use two distinct functions. - arg5: If true the resulting k elements will be sorted by the values. - R, MXNet, CNTK, Dask and MATLAB only return sorted elements. **Summary**: - Function Name: could be `topk`, `top_k`, `mink`/`maxk`. - arg1 (a), arg2 (k), arg3 (axis): should be required. - arg4 (largest), arg4 (sorted): might be discussed. - Returns: discussed below. ## 2. the argument Eric made on your PR about consistency with sort/argsort, if we want topk/argtopk? Also, do other libraries have `argtopk` In most libraries, `topk` or `top_k` returns both values and indices, and `argtopk` is not included except for Dask. In addition, there is another inconsistency: `sort` returns ascending values, but `topk` returns descending values. ## Suggestions Finally, IMHO, new function signature might be designed as one of: I) use `topk` / `argtopk` or `top_k` / `argtop_k` ```python def topk(a, k, axis=-1, sorted=True) -> topk_values def argtopk(a, k, axis=-1, sorted=True) -> topk_indices ``` or ```python def top_k(a, k, axis=-1, sorted=True) -> topk_values def argtop_k(a, k, axis=-1, sorted=True) -> topk_indices ``` where `k` can be negative which means to return k smallest elements. II) use `maxk` / `argmaxk` or `max_k` / `argmax_k` (`mink` / `argmink` or `min_k` / `argmin_k`) ```python def maxk(a, k, axis=-1, sorted=True) -> values def argmaxk(a, k, axis=-1, sorted=True) -> indices def mink(a, k, axis=-1, sorted=True) -> values def argmink(a, k, axis=-1, sorted=True) -> indices ``` or ```python def max_k(a, k, axis=-1, sorted=True) -> values def argmax_k(a, k, axis=-1, sorted=True) -> indices def min_k(a, k, axis=-1, sorted=True) -> values def argmin_k(a, k, axis=-1, sorted=True) -> indices ``` where `k` must be positive. **References**: - [1] https://github.com/numpy/numpy/pull/19117 - [2] https://pytorch.org/docs/stable/generated/torch.topk.html - [3] https://www.rdocumentation.org/packages/tensr/versions/1.0.1/topics/topK - [4] https://mxnet.apache.org/versions/master/api/python/docs/api/npx/generated/m... - [5] https://docs.microsoft.com/en-us/python/api/cntk/cntk.ops?view=cntk-py-2.7#t... - [6] https://tensorflow.google.cn/api_docs/python/tf/math/top_k?hl=zh-cn - [7] https://docs.dask.org/en/latest/array-api.html?highlight=topk#dask.array.top... - [8] https://docs.dask.org/en/latest/array-api.html?highlight=topk#dask.array.arg... - [9] https://nl.mathworks.com/help/matlab/ref/maxk.html - [10] https://nl.mathworks.com/help/matlab/ref/mink.html

Alan G. Isaac

4:50 a.m.

Mathematica and Julia both seem relevant here. Mma has TakeLargest (and Wolfram tends to think hard about names). https://reference.wolfram.com/language/ref/TakeLargest.html Julia's closest comparable is perhaps partialsortperm: https://docs.julialang.org/en/v1/base/sort/#Base.Sort.partialsortperm Alan Isaac On 5/30/2021 4:40 AM, kangkai@mail.ustc.edu.cn wrote:

...

Hi, Thanks for reply, I present some details below:

Neal Becker

5:22 a.m.

Topk is a bad choice imo. I initially parsed it as to_pk, and had no idea what that was, although sounded a lot like a scipy signal function. Nlargest would be very obvious. On Sun, May 30, 2021, 7:50 AM Alan G. Isaac <alan.isaac@gmail.com> wrote:

...

Mathematica and Julia both seem relevant here. Mma has TakeLargest (and Wolfram tends to think hard about names). https://reference.wolfram.com/language/ref/TakeLargest.html Julia's closest comparable is perhaps partialsortperm: https://docs.julialang.org/en/v1/base/sort/#Base.Sort.partialsortperm Alan Isaac

On 5/30/2021 4:40 AM, kangkai@mail.ustc.edu.cn wrote:

...
Hi, Thanks for reply, I present some details below:

NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Benjamin Root

8:31 p.m.

to be honest, I read "topk" as "topeka", but I am weird. While numpy doesn't use underscores all that much, I think this is one case where it makes sense. I'd also watch out for the use of the term "sorted", as it may mean different things to different people, particularly with regards to what its default value should be. I also find myself initially confused by the names "largest" and "sorted", especially what should they mean with the "min-k" behavior. I think Dask's use of negative k is very pythonic and would help keep the namespace clean by avoiding the extra "min_k". As for the indices, I am of two minds. On the one hand, I don't like polluting the namespace with extra functions. On the other hand, having a function that behaves differently based on a parameter is just fugly, although we do have a function that does this - np.unique(). Ben Root On Sun, May 30, 2021 at 8:22 AM Neal Becker <ndbecker2@gmail.com> wrote:

...

Topk is a bad choice imo. I initially parsed it as to_pk, and had no idea what that was, although sounded a lot like a scipy signal function. Nlargest would be very obvious.

On Sun, May 30, 2021, 7:50 AM Alan G. Isaac <alan.isaac@gmail.com> wrote:

...
Mathematica and Julia both seem relevant here. Mma has TakeLargest (and Wolfram tends to think hard about names). https://reference.wolfram.com/language/ref/TakeLargest.html Julia's closest comparable is perhaps partialsortperm: https://docs.julialang.org/en/v1/base/sort/#Base.Sort.partialsortperm Alan Isaac

On 5/30/2021 4:40 AM, kangkai@mail.ustc.edu.cn wrote:

...
Hi, Thanks for reply, I present some details below:

NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Jonathan Fine

31 May 31 May

8:10 a.m.

Here's my opinion, as a bit of an outsider. Mainly, I understand MAX to mean the largest value in a finite totally ordered set. I understand TOP to mean the 'best' member of a finite set. For example, on a mountain each point has a HEIGHT. There will be a MAX HEIGHT. The point(s) on the mountain that is the highest is the SUMMIT. Or in other words the TOP of the mountain. Or another example, there are TOP 40 charts for music. https://www.officialcharts.com/ To summarize, use MAX for the largest value in a totally ordered set. Use TOP when you have a height (or similar) function applied to an unordered set. The highest temperature in 2021 will occur on the hottest day(s). One is a temperature, the other a date. I'm an outsider, and I've not made an effort to gain special knowledge about the domain prior to posting this opinion. I hope it helps. Please ignore it if it does not. -- Jonathan

Ralf Gommers

9:49 a.m.

On Sun, May 30, 2021 at 10:41 AM <kangkai@mail.ustc.edu.cn> wrote:

...

...
On Fri, May 28, 2021 at 4:58 PM <kangkai at mail.ustc.edu.cn <mailto:kangkai at mail.ustc.edu.cn>> wrote:

Hi all,

Finding topk elements is widely used in several fields, but missed in NumPy. I implement this functionality named as numpy.topk using core numpy functions and open a PR:

https://github.com/numpy/numpy/pull/19117 <https://github.com/numpy/numpy/pull/19117>

Any discussion are welcome.

Thanks for the proposal Kang. I think this functionality is indeed a fairly obvious gap in what Numpy offers, and would make sense to add. A detailed comparison with other libraries would be very helpful here. TensorFlow and JAX call this function `top_k`, while PyTorch, Dask and MXNet call it `topk`.

Two things to look at in more detail here are: 1. complete signatures of the function in each of those libraries, and what the commonality is there. 2. the argument Eric made on your PR about consistency with sort/argsort, and if we want topk/argtopk? Also, do other libraries have `argtopk`?

Cheers, Ralf

Best wishes,

Kang Kai

Hi, Thanks for reply, I present some details below:

Thanks for the detailed investigation Kang!

...

## 1. complete signatures of the function in each of those libraries, and what the commonality is there.

| Library | Name | arg1 | arg2 | arg3 | arg4 | arg5 |

|-------------|--------------------|-------|------|------|-----------|--------| | NumPy [1 ] | numpy.topk | a | k | axis | largest | sorted | | PyTorch [2 ] | torch.topk | input | k | dim | largest | sorted | | R [3 ] | topK | x | K | / | / | / | | MXNet [4 ] | mxnet.npx.topk | data | k | axis | is_ascend | / | | CNTK [5 ] | cntk.ops.top_k | x | k | axis | / | / | | TF [6 ] | tf.math.top_k | input | k | / | / | sorted | | Dask [7 ] | dask.array.topk | a | k | axis | -k | / | | Dask [8 ] | dask.array.argtopk | a | k | axis | -k | / | | MATLAB [9 ] | mink | A | k | dim | / | / | | MATLAB [10 ] | maxk | A | k | dim | / | / |

| Library | Name | Returns | |-------------|--------------------|---------------------| | NumPy [1] | numpy.topk | values, indices | | PyTorch [2] | torch.topk | values, indices | | R [3] | topK | indices | | MXNet [4] | mxnet.npx.topk | controls by ret_typ | | CNTK [5] | cntk.ops.top_k | values, indices | | TF [6] | tf.math.top_k | values, indices | | Dask [7] | dask.array.topk | values | | Dask [8] | dask.array.argtopk | indices | | MATLAB [9] | mink | values, indices | | MATLAB [10] | maxk | values, indices |

- arg1: Input array. - arg2: Number of top elements to look for along the given axis. - arg3: Axis along which to find topk. - R only supports vector, TensorFlow only supports axis=-1. - arg4: Controls whether to return k largest or smallest elements. - R, CNTK and TensorFlow only return k largest elements. - In Dask, k can be negative, which means to return k smallest elements. - In MATLAB, use two distinct functions. - arg5: If true the resulting k elements will be sorted by the values. - R, MXNet, CNTK, Dask and MATLAB only return sorted elements.

**Summary**: - Function Name: could be `topk`, `top_k`, `mink`/`maxk`. - arg1 (a), arg2 (k), arg3 (axis): should be required. - arg4 (largest), arg4 (sorted): might be discussed. - Returns: discussed below.

## 2. the argument Eric made on your PR about consistency with sort/argsort, if we want topk/argtopk? Also, do other libraries have `argtopk`

In most libraries, `topk` or `top_k` returns both values and indices, and `argtopk` is not included except for Dask. In addition, there is another inconsistency: `sort` returns ascending values, but `topk` returns descending values.

## Suggestions Finally, IMHO, new function signature might be designed as one of: I) use `topk` / `argtopk` or `top_k` / `argtop_k` ```python def topk(a, k, axis=-1, sorted=True) -> topk_values def argtopk(a, k, axis=-1, sorted=True) -> topk_indices ``` or ```python def top_k(a, k, axis=-1, sorted=True) -> topk_values def argtop_k(a, k, axis=-1, sorted=True) -> topk_indices ``` where `k` can be negative which means to return k smallest elements.

I don't think I'm a fan of the `-k` cleverness. Saying you want `-5` values as a stand-in for wanting the 5 smallest values is worse than a keyword imho. It seems like commenters so far have a preference for `top_k` over `topk`, because of readability. Either way it's going to impact Dask, JAX, etc. - so it would be nice to get some input from maintainers of those libraries. The two functions vs. returning `(values, indices)` is also a tricky choice - it may depend on usage patterns. If one needs indices a lot, then there's something to say for the tuple return. Otherwise the code is going to look like: indices = argtop_k(x, ....) values = x[indices] which is significantly worse than: values, indices = top_k(x, ...)

...

II) use `maxk` / `argmaxk` or `max_k` / `argmax_k` (`mink` / `argmink` or `min_k` / `argmin_k`)

I suggest to forget about maxk/max_k. All Python libraries call it topk/top_k. And Matlab choosing something is usually a good reason to run in the other direction. Cheers, Ralf ```python

...

def maxk(a, k, axis=-1, sorted=True) -> values def argmaxk(a, k, axis=-1, sorted=True) -> indices

def mink(a, k, axis=-1, sorted=True) -> values def argmink(a, k, axis=-1, sorted=True) -> indices ``` or ```python def max_k(a, k, axis=-1, sorted=True) -> values def argmax_k(a, k, axis=-1, sorted=True) -> indices

def min_k(a, k, axis=-1, sorted=True) -> values def argmin_k(a, k, axis=-1, sorted=True) -> indices ``` where `k` must be positive.

**References**: - [1] https://github.com/numpy/numpy/pull/19117 - [2] https://pytorch.org/docs/stable/generated/torch.topk.html - [3] https://www.rdocumentation.org/packages/tensr/versions/1.0.1/topics/topK - [4] https://mxnet.apache.org/versions/master/api/python/docs/api/npx/generated/m... - [5] https://docs.microsoft.com/en-us/python/api/cntk/cntk.ops?view=cntk-py-2.7#t... - [6] https://tensorflow.google.cn/api_docs/python/tf/math/top_k?hl=zh-cn - [7] https://docs.dask.org/en/latest/array-api.html?highlight=topk#dask.array.top... - [8] https://docs.dask.org/en/latest/array-api.html?highlight=topk#dask.array.arg... - [9] https://nl.mathworks.com/help/matlab/ref/maxk.html - [10] https://nl.mathworks.com/help/matlab/ref/mink.html

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

1154

Age (days ago)

1157

Last active (days ago)

List overview

Download

16 comments

11 participants

participants (11)

Alan G. Isaac
Benjamin Root
Daniele Nicolodi
David Menéndez Hurtado
Ilhan Polat
Jonathan Fine
kangkai＠mail.ustc.edu.cn
Matti Picus
Neal Becker
Ralf Gommers
Robert Kern

EHN: Discusions about 'add numpy.topk'

kangkai＠mail.ustc.edu.cn

kangkai＠mail.ustc.edu.cn

tags

participants (11)