[Numpy-discussion] Fwd: Allowing broadcasting of code dimensions in generalized ufuncs
Nathaniel Smith
njs at pobox.com
Tue Jul 3 04:27:56 EDT 2018
On Sat, Jun 30, 2018 at 6:51 AM, Marten van Kerkwijk
<m.h.vankerkwijk at gmail.com> wrote:
> Hi All,
>
> In case it was missed because people have tuned out of the thread: Matti and
> I proposed last Tuesday to accept NEP 20 (on coming Tuesday, as per NEP 0),
> which introduces notation for generalized ufuncs allowing fixed, flexible
> and broadcastable core dimensions. For one thing, this will allow Matti to
> finish his work on making matmul a gufunc.
>
> See http://www.numpy.org/neps/nep-0020-gufunc-signature-enhancement.html
So I still have some of the same concerns as before...
For the possibly missing dimensions: matmul is really important, and
making it a gufunc solves the problem of making it overridable by duck
arrays (via __array_ufunc__). Also, it will help later when we rework
dtypes: new dtypes will be able to implement matmul by the normal
ufunc loop registration mechanism, which is much nicer than the
current system where every dtype has a special-case method just for
handling matmul. The ? proposal isn't the most elegant idea ever, but
we've been tossing around ideas for solving these problems for a
while, and so far this seems to be the least-bad one, so... sure,
let's do it.
For the fixed-size dimensions: this makes me nervous. It's aimed at a
real use case, which is a major point in it's favor. But a few things
make me wary. For input dimensions, it's sugar – the gufunc loop can
already raise an error if it doesn't like the size it gets. For output
dimensions, it does solve a real problem. But... only part of it. It's
awkward that right now you only have a few limited ways to choose
output dimensions, but this just extends the list of special cases,
rather than solving the underlying problem. For example,
'np.linalg.qr' needs a much more generic mechanism to choose output
shape, and parametrized dtypes will need a much more generic mechanism
to choose output dtype, so we're definitely going to end up with some
phase where arbitrary code gets to describe the output array. Are we
going to look back on fixed-size dimensions as a quirky, redundant
thing?
Also, as currently proposed, it seems to rule out the possibility of
using name-based axis specification in the future, right? (See
https://github.com/numpy/numpy/pull/8819#issuecomment-366329325) Are
we sure we want to do that?
If everyone else is comfortable with all these things then I won't
block it though.
For broadcasting: I'm sorry, but I think I'm -1 on this. I feel like
it falls into a classic anti-pattern in numpy, where someone sees a
cool thing they could do and then goes looking for problems to justify
it. (A red flag for me is that "it's easy to implement" keeps being
mentioned as justification for doing it.) The all_equal and
weighted_mean examples both feel pretty artificial -- traditionally
we've always implemented these kinds of functions as regular functions
that use (g)ufuncs internally, and it's worked fine (cf. np.allclose,
ndarray.mean). In fact in some sense the whole point of numpy is to
help people implement functions like this, without having to write
their own gufuncs. Is there some reason these need to be gufuncs? And
if there is, are these the only things that need to be gufuncs, or is
there a broader class we're missing? The design just doesn't feel
well-justified to me.
And in the past, when we've implemented things like this, where the
use cases are thin but hey why not it's easy to do, it's ended up
causing two problems: first people start trying to force it into cases
where it doesn't quite work, which makes everyone unhappy... and then
when we eventually do try to solve the problem properly, we end up
having to do elaborate workarounds to keep the old not-quite-working
use cases from breaking.
I'm pretty sure we're going to end up rewriting most of the ufunc code
over the next few years as we ramp up duck array and user dtype
support, and it's already going to be very difficult, both to design
in the first place and then to implement while carefully keeping shims
to keep all the old stuff working. Adding features has a very real
cost, because it adds extra constraints that all this future work will
have to work around. I don't think this meets the bar.
By the way, I also think we're getting well past the point where we
should be switching from a string-based DSL to a more structured
representation. (This is another trap that numpy tends to fall into...
the dtype "language" is also a major offender.) This isn't really a
commentary on any part of this in particular, but just something that
I've been noticing and wanted to mention :-).
-n
--
Nathaniel J. Smith -- https://vorpus.org
More information about the NumPy-Discussion
mailing list