[Numpy-discussion] Low-level API for Random

Sat Sep 21 00:30:41 EDT 2019

On Fri, Sep 20, 2019 at 11:33 PM Ralf Gommers <ralf.gommers at gmail.com>
wrote:

>
>
> On Fri, Sep 20, 2019 at 7:09 AM Robert Kern <robert.kern at gmail.com> wrote:
>
>>
>>
>> On Fri, Sep 20, 2019 at 6:09 AM Ralf Gommers <ralf.gommers at gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Fri, Sep 20, 2019 at 5:29 AM Robert Kern <robert.kern at gmail.com>
>>> wrote:
>>>
>>>>
>>>> We might end up with more than 2 implementations if we need to change
>>>> something about the function signature, for whatever reason, and we want to
>>>> retain C/Cython API compatibility with older code. The C functions aren't
>>>> necessarily going to be one-to-one to the Generator methods. They're just
>>>> part of the implementation. So for example, if we wanted to, say,
>>>> precompute some intermediate values from the given scalar parameters so we
>>>> don't have to recompute them for each element of the `size`-large requested
>>>> output, we might do that in one C function and pass those intermediate
>>>> values as arguments to the C function that does the actual sampling. So
>>>> we'd have two C functions for that one Generator method, and the sampling C
>>>> function will not have the same signature as it did before the modification
>>>> that refactored the work into two functions. In that case, I would not be
>>>> so strict as to require that `Generator.foo` is one to one with
>>>> `random_foo`.
>>>>
>>>
>>> You're saying "be so strict" as if it were a bad thing, or a major
>>> effort.
>>>
>>
>> I am. It's an unnecessary limitation on the C API without a corresponding
>> benefit. Your original complaint
>>
>
> It's not a "complaint".
>

Please forgive me. That word choice was not intended to be dismissive. I
don't view "complaints" as minor annoyances that the "complainer" should
just shut up and deal with, or that the "complainer" is just being
annoying, but I can see how it came across that I might. Please continue as
if I said "The problem you originally noted...". It's a real problem that
needs to be addressed. We just have different thoughts on exactly what is
needed to address it.

> We're having this discussion because we shipped a partial API in 1.17.0
> that we will now have to go back and either take out or clean up in 1.17.3.
> The PR for the new numpy.random grew so large that we didn't notice or
> discuss that (such things happen, no big deal - we have limited reviewer
> bandwidth). So now that we do, it makes sense to actually think about what
> needs to be in the API. For now I think that's only the parts that are
> matching the Python API plus what is needed to use them from C/Cython.
> Future additions require similar review and criteria as adding to the
> Python API and the existing NumPy C API. To me, your example seems to (a)
> not deal with API stability, and (b) expose too much implementation detail.
>
> To be clear about the actual status, we:
> - shipped one header file (bitgen.h)
> - shipped two pxd files (common.pxd, bit_generator.pxd)
> - removed a header file we used to ship (randomkit.h)
> - did not ship distributions.pxd, bounded_integers.pxd,
> legacy_distributions.pxd or related header files
>
> bit_generator.pxd looks fine, common.pxd contains parts that shouldn't be
> there. I think the intent was to ship at least distributions.pxd/h, and
> perhaps all of those pxd files.
>
> is much more directly addressed by a "don't gratuitously name related C
>> functions differently than the Python methods they implement" rule (e.g.
>> "gauss" instead of "normal").
>>
>>
>>> I understand that in some cases a C API can not be evolved in the same
>>> way as a Python API, but in the example you're giving here I'd say you want
>>> one function to be public, and one private. Making both public just exposes
>>> more implementation details for no good reason, and will give us more
>>> maintenance issues long-term.
>>>
>>
>> Not at all. In this example, neither one of those functions is useful
>> without the other. If one is public, both must be.
>>
>
> If neither one is useful without the other, it sounds like both should be
> private and the third one that puts them together - the one that didn't
> change signature and implements `Generator.foo` - is the public one.
>

That defeats the point of using the C API in this instance, though. The
reason it got split into two (in this plausible hypothetical; I'm thinking
of the binomial implementation here, which caches these intermediates in a
passed-in struct) is because in C you want to call them in different ways
(in this case, the prep function once and the sampling function many
times). It's not that you always call them in lockstep pairs: `prep();
sample(); prep(); sample();`. A C function that combines them defeats the
efficiency that one wanted to gain by using the C API. The C API has
different needs than the Python API, because the Python API has a lot more
support from the Python language and numpy data structures to be able to
jam a lot of functionality into a single function signature that C just
doesn't give us. The purpose of the C API is not just to avoid Python
function call overhead. If there's a reason that the Generator method needs
the implementation split up into multiple C functions, that's a really
strong signal that *other* C code using the C API will need that same
split. It's not just an implementation detail; it's a documented use case.

Given the prevalence of Cython, it's actually really easy to use the Python
API pretty easily in "C", so it's actually a huge waste if the C API
matches the Python API too closely. The power and utility of the C API will
be in how it *differs* from the Python API. For the distribution methods,
this is largely in how it lets you sample one number at a time without
bothering with the numpy and broadcasting overhead. That's the driving
motivation for having a C API for the distributions, and the algorithms
that we choose have consequences for the C API that will best satisfy that
motivation.

The issue that this raises with API stability is that if you require a
one-to-one match between the C API function and the Generator method, we
can never change the function signature of the C function. That's going to
forbid us from moving from an algorithm that doesn't need any
precomputation to one that does. That precomputation either requires a
2-function dance or a new argument to keep the cached values (c.f.
`random_binomial()`), so it's always going to affect the API. To use such a
new algorithm, we'll have to add a new function or two to the C API and
document the deprecation of the older API function. We can't just swap it
in under the same name, even if the new function is standalone. That's a
significant constraint on future development when the main issue that led
to the suggestion is that the names were sometimes gratuitously different
between the C API and the Python API, which hindered discoverability. We
can fix *that* problem easily without constraining the universe of
algorithms that we might consider using in the future.

-- 
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20190921/b4da863f/attachment-0001.html>