Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

30 May 2021

      On 30/05/2021 00:48, Robert Kern wrote:
...
On Sat, May 29, 2021 at 3:35 PM Daniele Nicolodi <daniele@grinta.net
<mailto:daniele@grinta.net>> wrote:
What does k stand for here? As someone that never encountered this
    function before I find both names equally confusing. If I understand
    what the function is supposed to be doing, I think largest() would be
    much more descriptive.
`k` is the number of elements to return. `largest()` can connote that
it's only returning the one largest value. It's fairly typical to
include a dummy variable (`k` or `n`) in the name to indicate that the
function lets you specify how many you want. See, for example, the
stdlib `heapq` module's `nlargest()` function.
I thought that a `largest()` function with an integer second argument
could be enough self explanatory. `nlargest()` would be much more
obvious to the wider audience, I think.
...
https://docs.python.org/3/library/heapq.html#heapq.nlargest
<https://docs.python.org/3/library/heapq.html#heapq.nlargest>
"top-k" comes from the ML community where this function is used to
evaluate classification models (`k` instead of `n` being largely an
accident of history, I imagine). In many classification problems, the
number of classes is very large, and they are very related to each
other. For example, ImageNet has a lot of different dog breeds broken
out as separate classes. In order to get a more balanced view of the
relative performance of the classification models, you often want to
check whether the correct class is in the top 5 classes (or whatever `k`
is appropriate) that the model predicted for the example, not just the
one class that the model says is the most likely. "5 largest" doesn't
really work in the sentences that one usually writes when talking about
ML classifiers; they are talking about the 5 classes that are associated
with the 5 largest values from the predictor, not the values themselves.
So "top k" is what gets used in ML discussions, and that transfers over
to the name of the function in ML libraries.
It is a top-down reflection of the higher level thing that people want
to compute (in that context) rather than a bottom-up description of how
the function is manipulating the input, if that makes sense. Either one
is a valid way to name things. There is a lot to be said for numpy's
domain-agnostic nature that we should prefer the bottom-up description
style of naming. However, we are also in the midst of a diversifying
ecosystem of array libraries, largely driven by the ML domain, and
adopting some of that terminology when we try to enhance our
interoperability with those libraries is also a factor to be considered.
I think that such a simple function should be named in the most obvious
way possible, or it will become one function that will be used in the
domains where the unusual name makes sense, but will end being
re-implemented in all other contexts. I am sure that if I would have
been looking for a function that returns the N largest items in an array
(being that intended accordingly to a given key function or otherwise) I
would never have looked at a function named `topk()` or `top_k()` and I
am pretty sure I would have discarded anything that has `k` or `top` in
its name.

On the other hand, I understand that ML is where all the hipe (and a
large fraction of the money) is this days, thus I understand if numpy
wants to appease the crowd.

Cheers,
Dan