On Fri, Sep 20, 2019 at 11:33 PM Ralf Gommers <ralf.gommers@gmail.com> wrote:

On Fri, Sep 20, 2019 at 7:09 AM Robert Kern <robert.kern@gmail.com> wrote:

On Fri, Sep 20, 2019 at 6:09 AM Ralf Gommers <ralf.gommers@gmail.com> wrote:

On Fri, Sep 20, 2019 at 5:29 AM Robert Kern <robert.kern@gmail.com> wrote:

We might end up with more than 2 implementations if we need to change something about the function signature, for whatever reason, and we want to retain C/Cython API compatibility with older code. The C functions aren't necessarily going to be one-to-one to the Generator methods. They're just part of the implementation. So for example, if we wanted to, say, precompute some intermediate values from the given scalar parameters so we don't have to recompute them for each element of the `size`-large requested output, we might do that in one C function and pass those intermediate values as arguments to the C function that does the actual sampling. So we'd have two C functions for that one Generator method, and the sampling C function will not have the same signature as it did before the modification that refactored the work into two functions. In that case, I would not be so strict as to require that `Generator.foo` is one to one with `random_foo`.

You're saying "be so strict" as if it were a bad thing, or a major effort.

I am. It's an unnecessary limitation on the C API without a corresponding benefit. Your original complaint

It's not a "complaint".

Please forgive me. That word choice was not intended to be dismissive. I don't view "complaints" as minor annoyances that the "complainer" should just shut up and deal with, or that the "complainer" is just being annoying, but I can see how it came across that I might. Please continue as if I said "The problem you originally noted...". It's a real problem that needs to be addressed. We just have different thoughts on exactly what is needed to address it.

We're having this discussion because we shipped a partial API in 1.17.0 that we will now have to go back and either take out or clean up in 1.17.3. The PR for the new numpy.random grew so large that we didn't notice or discuss that (such things happen, no big deal - we have limited reviewer bandwidth). So now that we do, it makes sense to actually think about what needs to be in the API. For now I think that's only the parts that are matching the Python API plus what is needed to use them from C/Cython. Future additions require similar review and criteria as adding to the Python API and the existing NumPy C API. To me, your example seems to (a) not deal with API stability, and (b) expose too much implementation detail.

To be clear about the actual status, we:
- shipped one header file (bitgen.h)
- shipped two pxd files (common.pxd, bit_generator.pxd)
- removed a header file we used to ship (randomkit.h)
- did not ship distributions.pxd, bounded_integers.pxd, legacy_distributions.pxd or related header files

bit_generator.pxd looks fine, common.pxd contains parts that shouldn't be there. I think the intent was to ship at least distributions.pxd/h, and perhaps all of those pxd files.

is much more directly addressed by a "don't gratuitously name related C functions differently than the Python methods they implement" rule (e.g. "gauss" instead of "normal").

I understand that in some cases a C API can not be evolved in the same way as a Python API, but in the example you're giving here I'd say you want one function to be public, and one private. Making both public just exposes more implementation details for no good reason, and will give us more maintenance issues long-term.

Not at all. In this example, neither one of those functions is useful without the other. If one is public, both must be.

If neither one is useful without the other, it sounds like both should be private and the third one that puts them together - the one that didn't change signature and implements `Generator.foo` - is the public one.

That defeats the point of using the C API in this instance, though. The reason it got split into two (in this plausible hypothetical; I'm thinking of the binomial implementation here, which caches these intermediates in a passed-in struct) is because in C you want to call them in different ways (in this case, the prep function once and the sampling function many times). It's not that you always call them in lockstep pairs: `prep(); sample(); prep(); sample();`. A C function that combines them defeats the efficiency that one wanted to gain by using the C API. The C API has different needs than the Python API, because the Python API has a lot more support from the Python language and numpy data structures to be able to jam a lot of functionality into a single function signature that C just doesn't give us. The purpose of the C API is not just to avoid Python function call overhead. If there's a reason that the Generator method needs the implementation split up into multiple C functions, that's a really strong signal that *other* C code using the C API will need that same split. It's not just an implementation detail; it's a documented use case.

Given the prevalence of Cython, it's actually really easy to use the Python API pretty easily in "C", so it's actually a huge waste if the C API matches the Python API too closely. The power and utility of the C API will be in how it *differs* from the Python API. For the distribution methods, this is largely in how it lets you sample one number at a time without bothering with the numpy and broadcasting overhead. That's the driving motivation for having a C API for the distributions, and the algorithms that we choose have consequences for the C API that will best satisfy that motivation.

The issue that this raises with API stability is that if you require a one-to-one match between the C API function and the Generator method, we can never change the function signature of the C function. That's going to forbid us from moving from an algorithm that doesn't need any precomputation to one that does. That precomputation either requires a 2-function dance or a new argument to keep the cached values (c.f. `random_binomial()`), so it's always going to affect the API. To use such a new algorithm, we'll have to add a new function or two to the C API and document the deprecation of the older API function. We can't just swap it in under the same name, even if the new function is standalone. That's a significant constraint on future development when the main issue that led to the suggestion is that the names were sometimes gratuitously different between the C API and the Python API, which hindered discoverability. We can fix *that* problem easily without constraining the universe of algorithms that we might consider using in the future.

Robert Kern