[Numpy-discussion] NEP for faster ufuncs

Mark Wiebe mwwiebe at gmail.com
Wed Dec 22 14:42:54 EST 2010

On Wed, Dec 22, 2010 at 11:16 AM, Francesc Alted <faltet at pytables.org>wrote:

> A Wednesday 22 December 2010 19:52:45 Mark Wiebe escrigué:
> > On Wed, Dec 22, 2010 at 10:41 AM, Francesc Alted
> <faltet at pytables.org>wrote:
> > > NumPy version 2.0.0.dev-147f817
> >
> > There's your problem, it looks like the PYTHONPATH isn't seeing your
> > new build for some reason.  That build is off of this commit in the
> > NumPy master branch:
> >
> > https://github.com/numpy/numpy/commit/147f817eefd5efa56fa26b03953a51d
> > 533cc27ec
> Uh, I think I'm a bit lost here.  I've cloned this repo:
> $ git clone git://github.com/m-paradox/numpy.git
> Is that wrong?

That's right, it was my mistake to assume that the page for a branch on
github would give you that branch.  You need the 'new_iterator' branch, so
after that clone, you should do this:

$ git checkout origin/new_iterator

> > Ah, okay.  However, Numexpr is not meant to accelerate calculations
> > > with small operands.  I suppose that this is where your new
> > > iterator makes more sense: accelerating operations where some of
> > > the operands are small (i.e. fit in cache) and have to be
> > > broadcasted to match the dimensionality of the others.
> >
> > It's not about small operands, but small chunks of the operands at a
> > time, with temporary arrays for intermediate calculations.  It's the
> > small chunks + temporaries which must fit in cache to get the
> > benefit, not the whole array.
> But you need to transport those small chunks from main memory to cache
> before you can start doing the computation for this piece, right?  This
> is what I'm saying that the bottleneck for evaluating arbitrary
> expressions (like "3*a+b-(a/c)", i.e. not including transcendental
> functions, nor broadcasting) is memory bandwidth (and more in particular
> RAM bandwidth).

In the example expression, I believe the evaluation would go something like
this.  Assuming the memory allocator keeps giving back the same locations to
'luf', all temporary variables will already be in cache after the first

temp1 = 3 * a             # a is read from main memory
temp2 = temp1 + b     # b is read from main memory
temp3 = a / c             # a is already in cache, c is read from main
result = temp2 + temp3 # result is written to data from main memory

So there are 4 reads and writes to chunks from outside of the cache, but 12
total reads and writes to chunks, so speeding up the parts already in cache
would appear to be beneficial.  The benefit will get better with more
complicated expressions.  I think as long as the operation is slower than a
memcpy, the RAM bandwidth isn't the main bottleneck to be concerned with,
but instead produces an upper bound on performance.  I'm not sure how to
precisely measure that overhead, though.

> > The numexpr front page explains this
> > fairly well in the section "Why It Works":
> >
> > http://code.google.com/p/numexpr/#Why_It_Works
> I know.  I wrote that part (based on the notes by David Cooke, the
> original author ;-)

Cool :)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20101222/a8ec09d7/attachment.html>

More information about the NumPy-Discussion mailing list