[Numpy-discussion] Broadcasting rules (Ticket 76).

Wed Apr 26 14:18:06 EDT 2006

I would like to apologize up-front if anyone found my overly general
arguments inappropriate.  I did not intend to be critical about
anyone's code or design other than my own.  Any references to "bad
design" or "broken code" are related to my own misguided attempts to
use some of the Numeric features in the past.  It turned out that
dimension-preserving broadcasting was a wrong feature to use for a
specific class of problems that I am dealing with most of the time. 
This does not mean, that it cannot be used appropriately in other
domains.

I was wrong in posting overly general opinions without providing
specific examples.  I will try to do better in this post.  Before I do
that, however, let me try to explain why I hold strong views on
certain things.  In my view the most appealing feature in Python is
the Zen of Python <http://www.python.org/doc/Humor.html#zen>" and in
particular "There should be one-- and preferably only one --obvious
way to do it."  In my view Python represents the "hard science"
approach appealing to physics and math types while Perl is more of a
"soft science" language. (There is nothing wrong with either Perl or
soft sciences.) This is what make Python so appealing for scientific
computing.  Unfortunately, it is the fact of life that there are
always many ways to solve the same problem and a successful "pythonic"
design has to pick one (preferably the best) of the possible ways and
make it obvious.

This said, let me present a specific problem that I will use to
illustrate my points below.  Suppose we study school statistics in
different cities.  Let city A have 10 schools with 20 classes and 30
students in each.  It is natural to organize the data collected about
the students in a 10x20x30 array.  It is also natural to collect some
of the data at the per-school or per-class level.  This data may come
from aggregating student level statistics (say average test score) or
from the characteristics that are class or school specific (say the
grade or primary language).  There are two obvious ways to present
such data. 1) We can use 3-d arrays for everything and make the shape
of the per-class array 10x20x1 and the shape of per-school array
10x1x1; and 2) use 2-d and 1-d arrays.  The first approach seems to be
more flexible.  We can also have 10x1x30 or 1x1x30 arrays to represent
data which varies along the student dimension, but is constant across
schools or classes.  However, this added benefit is illusory: the
first student in one class list has no relationship to the first 
student in the other class, so in this particular problem an average
score of the first student across classes makes no sense (it will also
depend on whether students are ordered alphabetically or by an
achievement rank).

On the other hand this approach has a very significant drawback:
functions that process city data have no way to distinguish between
per-school data and a lucky city that can afford educating its
students in individual classes.  Just as it is extremely unlikely to
have one student per class in our toy example, in real-world problems
it is not unreasonable to assume that dimension of size 1 represents
aggregate data.  A software designed based on this assumption is what
I would call broken in a subtle way.

Please see more below.

On 4/26/06, Travis Oliphant <oliphant.travis at ieee.org> wrote:
> Sasha wrote:
> > On 4/25/06, tim.hochberg at cox.net <tim.hochberg at cox.net> wrote:
> >
> >> ---- Travis Oliphant <oliphant.travis at ieee.org> wrote:
> [...]
> I don't think anyone is fundamentally opposed to multiple repetitions.
> We're just being cautious.   Also, as you've noted, the assignment code
> is currently not using the ufunc broadcasting code and so they really
> aren't the same thing, yet.

It looks like there is a lot of development in this area going on at
the moment.  Please let me know if I can help.

> [...]
> > In my experience broadcasting length-1 and not broadcasting other
> > lengths is very error prone as it is.
>
> That's not been my experience.

I should have been more specific.  As I explained above, the special
properties of length-1 led me to design a system that distinguished
aggregate data by testing for unit length.  This system was subtly
broken.  In a rare case when the population had only one element, the
system was producing wrong results.

> But, I don't know R very well.  I'm very
> interested in what ideas you can bring.
>

R takes a very simple approach: everything is a vector.  There are no
scalars, if you need a scalar, you use a vector of length 1. 
Broadcasting is simply repetition:

> x <- rep(0,10)
> x + c(1,2)
 [1] 1 2 1 2 1 2 1 2 1 2

the length of the larger vector does not even need to be a multiple of
the shorter, but in this case a warning is issued:

> x + c(1,2,3)
 [1] 1 2 3 1 2 3 1 2 3 1
Warning message:
longer object length
        is not a multiple of shorter object length in: x + c(1, 2, 3)

Multi-dimensional arrays are implemented by setting a "dim" attribute:

> dim(x) <- c(2,5)
> x
     [,1] [,2] [,3] [,4] [,5]
[1,]    0    0    0    0    0
[2,]    0    0    0    0    0

(R uses Fortran order).  Broadcasting ignores the dim attribute, but
does the right thing for conformable vectors:

> x + c(1,2)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    1    1    1    1
[2,]    2    2    2    2    2

However, the following is unfortunate:
> x + 1:5
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    2    4
[2,]    2    4    1    3    5

> >  I understand that restricting
> > broadcasting to make it a strictly dimension-increasing operation is
> > not possible for two reasons:
> >
> > 1. Numpy cannot break legacy Numeric code.
> > 2. It is not possible to differentiate between 1-d array that
> > broadcasts column-wise vs. one that broadcasts raw-wise.
> >
> > In my view none of these reasons is valid.  In my experience Numeric
> > code that relies on dimension-preserving broadcasting is already
> > broken, only in a subtle and hard to reproduce way.
>
> I definitely don't agree with you here.  Dimension-preserving
> broadcasting is at the heart of the utility of broadcasting and it is
> very, very useful for that.  Calling it subtly broken suggests that you
> don't understand it and have never used it for it's intended purpose.
> I've used dimension-preserving broadcasting literally hundreds of
> times.  It's rather bold of you to say that all of that code is "broken"
>
Sorry I was not specific in the original post.  I hope you now
understand where I come from.  Can you point me to some examples of
the correct way to use dimension-preserving broadcasting?  I would
assume that it is probably more useful in the problem domains where
there is no natural ordering of the dimensions, unlike in the
hierarchial data example that I used.

> Now, I'm sure there are other useful ways to "broadcast",  but
> dimension-preserving is essentially what broadcasting *is* in NumPy.
> If anything it is the dimension-increasing rule that is somewhat
> arbitrary (e.g. why prepend with ones).
>
The dimension-increasing broadcasting is very natural when you deal
with hierarchical data where various dimensions correspond to the
levels of aggregation.  As I explained above, average student score
per class makes sense while the average score per student over classes
does not.  It is very common to combine per-class data with
per-student data by broadcasting per-class data.  For example, the
total time spent by student is a sum spent in regular per-class
session plus individual elected courses.

>
> Perhaps you want to introduce some other way for non-commensurate shapes
> to interact in an operation.   I think you will find many open minds on
> this list (although probably not anyone who will want to code it up :-)
> ).     We do welcome the discussion.    Your experience with other
> array-like languages is helpful.
>
I will be happy to contribute code if I see interest.

>
> >   Similarly the
> > need to broadcast over non-leading dimension is a sign of bad design.
> > In rare cases where such broadcasting is desirable, it can be easily
> > done via swapaxes which is a cheap operation.
> >
>
> Again, it would help if you would refrain from using negative words
> about coding styles that are different from your own.     Such
> broadcasting is not that rare.  It happens quite frequently, actually.
> The point of a language like Python is that you can write algorithms
> simply without struggling with optimization questions up front like you
> seem to be hinting at.
>
I hope you understand that I did not mean to criticize anyone's coding
style.  I was not really hinting at optimization issues, I just had a
particular design problem in mind (see above).  Incidentally,
dimension-increasing broadcasting does tend to lead to more efficient
code both in terms of memory utilization and more straightforward
algorithms with fewer special cases, but this was not really what I
was referring to.

> > On the other hand I don't see much problem in making
> > dimension-preserving broadcasting more permissive.  In R, for example,
> > (1-d) arrays can be broadcast to arbitrary size.  This has an
> > additional benefit that 1-d to 2-d broadcasting requires no special
> > code, it just happens because matrices inherit arithmetics from
> > vectors.  I've never had a problem with R rules being too loose.
> >
>
> So, please explain exactly what you mean.   Only a few on this list know
> what the R rules even are.

See above.

> > In my view the problem that your ticket highlighted is not so much in
> > the particular set of broadcasting rules, but in the fact that a[...]
> > = b uses one set of rules while a[...] += b uses another.  This is
> > *very* confusing.
> >
>
> Yes, this is admittedly confusing.  But, it's an outgrowth of the way
> Numeric code developed.  Broadcasting was always only a ufunc concept in
> Numeric, and copying was not a ufunc.    NumPy grew out of Numeric
> code.   I was not trying to mimick broadcasting behavior when I wrote
> the array copy and array setting code.  Perhaps I should have been.
>
In the spirit of appealing to obscure languages ;-), let me mention
that in the K language (kx.com) element assignment is implemented
using an Amend primitive that takes four arguments: @[x,i,f,y] id more
or less equivalent to numpy's x[i] = f(x[i], y[i]), where x, y and i
are vectors and f is a binary (broadcasting) function.  Thus, x[i] +=
y[i] can be written as @[x,i,+,y] and x[i] = y[i] is @[x,i,:,y], where
':' denotes a binary function that returns it's second argument and
ignores the first. K interpretor's Linux binary is less than 200K and
that includes a simple X window GUI! Such small code size would not be
possible without picking the right set of primitives and avoiding
special case code.

> I'm willing to change the code on this one, but only if the new copy
> code actually does implement broadcasting behavior equivalently.  And
> going through the ufunc machinery is probably a waste of effort because
> the copy code must be written for variable length arrays anyway (and
> ufuncs don't support them).
>
I know close to nothing about variable length arrays.  When I need to
deal with the relational database data, I transpose it so that each
column gets into its own fixed length array.  This is the approach
that both R and K take.  However, at least at the C level, I don't see
why ufunc code cannot be generalized to handle variable length arrays.
 At the python level, pre-defined arithmetic or math functions are
probably not feasible for variable length, but the ability to define a
variable length array function by just writing an inner loop
implementation may be quite useful.

> However, the broadcasting machinery has been abstracted in NumPy and can
> therefore be re-used in the copying code.  In Numeric, broadcasting was
> basically implemented deep inside a confusing while loop.

I've never understood the Numeric's while loop and completely agree
with your characterization.  I am still studying the numpy code, but
it is clearly a big improvement.