[Numpy-discussion] Proposed Roadmap Overview

Mon Feb 20 13:08:50 EST 2012

On 02/20/2012 09:34 AM, Christopher Jordan-Squire wrote:
> On Mon, Feb 20, 2012 at 9:18 AM, Dag Sverre Seljebotn
> <d.s.seljebotn at astro.uio.no>  wrote:
>> On 02/20/2012 08:55 AM, Sturla Molden wrote:
>>> Den 20.02.2012 17:42, skrev Sturla Molden:
>>>> There are still other options than C or C++ that are worth considering.
>>>> One would be to write NumPy in Python. E.g. we could use LLVM as a
>>>> JIT-compiler and produce the performance critical code we need on the fly.
>>>>
>>>>
>>>
>>> LLVM and its C/C++ frontend Clang are BSD licenced. It compiles faster
>>> than GCC and often produces better machine code. They can therefore be
>>> used inside an array library. It would give a faster NumPy, and we could
>>> keep most of it in Python.
>>
>> I think it is moot to focus on improving NumPy performance as long as in
>> practice all NumPy operations are memory bound due to the need to take a
>> trip through system memory for almost any operation. C/C++ is simply
>> "good enough". JIT is when you're chasing a 2x improvement or so, but
>> today NumPy can be 10-20x slower than a Cython loop.
>>
>
> I don't follow this. Could you expand a bit more? (Specifically, I
> wasn't aware that numpy could be 10-20x slower than a cython loop, if
> we're talking about the base numpy library--so core operations. I'm

The problem with NumPy is the temporaries needed -- if you want to compute

A + B + np.sqrt(D)

then, if the arrays are larger than cache size (a couple of megabytes), 
then each of those operations will first transfer the data in and out 
over the memory bus. I.e. first you compute an element of sqrt(D), then 
the result of that is put in system memory, then later the same number 
is read back in order to add it to an element in B, and so on.

The compute-to-bandwidth ratio of modern CPUs is between 30:1 and 
60:1... so in extreme cases it's cheaper to do 60 additions than to 
transfer a single number from system memory.

It is much faster to only transfer an element (or small block) from each 
of A, B, and D to CPU cache, then do the entire expression, then 
transfer the result back. This is easy to code in Cython/Fortran/C and 
impossible with NumPy/Python.

This is why numexpr/Theano exists.

You can make the slowdown over Cython/Fortran/C almost arbitrarily large 
by adding terms to the equation above. So of course, the actual slowdown 
depends on your usecase.

> also not totally sure why a JIT is a 2x improvement or so vs. cython.
> Not that a disagree on either of these points, I'd just like a bit
> more detail.)

I meant that the JIT may be a 2x improvement over the current NumPy C 
code. There's some logic when iterating arrays that could perhaps be 
specialized away depending on the actual array layout at runtime.

But I'm thinking that probably a JIT wouldn't help all that much, so 
it's probably 1x -- the 2x was just to be very conservative w.r.t. the 
argument I was making, as I don't know the NumPy C sources well enough.

Dag