[Numpy-discussion] EHN: Discusions about 'add numpy.topk'

Ilhan Polat ilhanpolat at gmail.com
Sun May 30 03:10:01 EDT 2021


after a coffee, I don't see the point of calling it still "k" so "max_n" is
my vote for what its worth.

On Sun, May 30, 2021 at 8:38 AM Ilhan Polat <ilhanpolat at gmail.com> wrote:

> Since this going into the top namespace, I'd also vote against the
> matlab-y "topk" name. And even matlab didn't do what I would expect and
> went with maxk
>
> https://nl.mathworks.com/help/matlab/ref/maxk.html
>
> I think "max_k" is a good generalization of the regular "max". Even when
> auto-completing, this showing up under max makes sense to me instead of
> searching them inside "t"s. Besides, "argmax_k" also follows suite, that of
> the previous convention. To my eyes this is an acceptable disturbance to an
> already very crowded namespace.
>
>
>
> a few moments later....
>
> But then again an ugly idea rears its head proposing this going into the
> existing max function. But I'll shut up now :)
>
>
>
>
>
>
>
> On Sun, May 30, 2021 at 12:50 AM Robert Kern <robert.kern at gmail.com>
> wrote:
>
>> On Sat, May 29, 2021 at 3:35 PM Daniele Nicolodi <daniele at grinta.net>
>> wrote:
>>
>>> What does k stand for here? As someone that never encountered this
>>> function before I find both names equally confusing. If I understand
>>> what the function is supposed to be doing, I think largest() would be
>>> much more descriptive.
>>>
>>
>> `k` is the number of elements to return. `largest()` can connote that
>> it's only returning the one largest value. It's fairly typical to include a
>> dummy variable (`k` or `n`) in the name to indicate that the function lets
>> you specify how many you want. See, for example, the stdlib `heapq`
>> module's `nlargest()` function.
>>
>> https://docs.python.org/3/library/heapq.html#heapq.nlargest
>>
>> "top-k" comes from the ML community where this function is used to
>> evaluate classification models (`k` instead of `n` being largely an
>> accident of history, I imagine). In many classification problems, the
>> number of classes is very large, and they are very related to each other.
>> For example, ImageNet has a lot of different dog breeds broken out as
>> separate classes. In order to get a more balanced view of the relative
>> performance of the classification models, you often want to check whether
>> the correct class is in the top 5 classes (or whatever `k` is appropriate)
>> that the model predicted for the example, not just the one class that the
>> model says is the most likely. "5 largest" doesn't really work in the
>> sentences that one usually writes when talking about ML classifiers; they
>> are talking about the 5 classes that are associated with the 5 largest
>> values from the predictor, not the values themselves. So "top k" is what
>> gets used in ML discussions, and that transfers over to the name of the
>> function in ML libraries.
>>
>> It is a top-down reflection of the higher level thing that people want to
>> compute (in that context) rather than a bottom-up description of how the
>> function is manipulating the input, if that makes sense. Either one is a
>> valid way to name things. There is a lot to be said for numpy's
>> domain-agnostic nature that we should prefer the bottom-up description
>> style of naming. However, we are also in the midst of a diversifying
>> ecosystem of array libraries, largely driven by the ML domain, and adopting
>> some of that terminology when we try to enhance our interoperability with
>> those libraries is also a factor to be considered.
>>
>> --
>> Robert Kern
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/numpy-discussion/attachments/20210530/73d76ae3/attachment.html>


More information about the NumPy-Discussion mailing list