
On 30/05/2021 00:48, Robert Kern wrote:
On Sat, May 29, 2021 at 3:35 PM Daniele Nicolodi <daniele@grinta.net <mailto:daniele@grinta.net>> wrote:
What does k stand for here? As someone that never encountered this function before I find both names equally confusing. If I understand what the function is supposed to be doing, I think largest() would be much more descriptive.
`k` is the number of elements to return. `largest()` can connote that it's only returning the one largest value. It's fairly typical to include a dummy variable (`k` or `n`) in the name to indicate that the function lets you specify how many you want. See, for example, the stdlib `heapq` module's `nlargest()` function.
I thought that a `largest()` function with an integer second argument could be enough self explanatory. `nlargest()` would be much more obvious to the wider audience, I think.
https://docs.python.org/3/library/heapq.html#heapq.nlargest <https://docs.python.org/3/library/heapq.html#heapq.nlargest>
"top-k" comes from the ML community where this function is used to evaluate classification models (`k` instead of `n` being largely an accident of history, I imagine). In many classification problems, the number of classes is very large, and they are very related to each other. For example, ImageNet has a lot of different dog breeds broken out as separate classes. In order to get a more balanced view of the relative performance of the classification models, you often want to check whether the correct class is in the top 5 classes (or whatever `k` is appropriate) that the model predicted for the example, not just the one class that the model says is the most likely. "5 largest" doesn't really work in the sentences that one usually writes when talking about ML classifiers; they are talking about the 5 classes that are associated with the 5 largest values from the predictor, not the values themselves. So "top k" is what gets used in ML discussions, and that transfers over to the name of the function in ML libraries.
It is a top-down reflection of the higher level thing that people want to compute (in that context) rather than a bottom-up description of how the function is manipulating the input, if that makes sense. Either one is a valid way to name things. There is a lot to be said for numpy's domain-agnostic nature that we should prefer the bottom-up description style of naming. However, we are also in the midst of a diversifying ecosystem of array libraries, largely driven by the ML domain, and adopting some of that terminology when we try to enhance our interoperability with those libraries is also a factor to be considered.
I think that such a simple function should be named in the most obvious way possible, or it will become one function that will be used in the domains where the unusual name makes sense, but will end being re-implemented in all other contexts. I am sure that if I would have been looking for a function that returns the N largest items in an array (being that intended accordingly to a given key function or otherwise) I would never have looked at a function named `topk()` or `top_k()` and I am pretty sure I would have discarded anything that has `k` or `top` in its name. On the other hand, I understand that ML is where all the hipe (and a large fraction of the money) is this days, thus I understand if numpy wants to appease the crowd. Cheers, Dan