[Numpy-discussion] Proposed Roadmap Overview

Mon Feb 20 13:18:40 EST 2012

On Feb 20, 2012, at 7:08 PM, Dag Sverre Seljebotn wrote:

> On 02/20/2012 09:34 AM, Christopher Jordan-Squire wrote:
>> On Mon, Feb 20, 2012 at 9:18 AM, Dag Sverre Seljebotn
>> <d.s.seljebotn at astro.uio.no>  wrote:
>>> On 02/20/2012 08:55 AM, Sturla Molden wrote:
>>>> Den 20.02.2012 17:42, skrev Sturla Molden:
>>>>> There are still other options than C or C++ that are worth considering.
>>>>> One would be to write NumPy in Python. E.g. we could use LLVM as a
>>>>> JIT-compiler and produce the performance critical code we need on the fly.
>>>>> 
>>>>> 
>>>> 
>>>> LLVM and its C/C++ frontend Clang are BSD licenced. It compiles faster
>>>> than GCC and often produces better machine code. They can therefore be
>>>> used inside an array library. It would give a faster NumPy, and we could
>>>> keep most of it in Python.
>>> 
>>> I think it is moot to focus on improving NumPy performance as long as in
>>> practice all NumPy operations are memory bound due to the need to take a
>>> trip through system memory for almost any operation. C/C++ is simply
>>> "good enough". JIT is when you're chasing a 2x improvement or so, but
>>> today NumPy can be 10-20x slower than a Cython loop.
>>> 
>> 
>> I don't follow this. Could you expand a bit more? (Specifically, I
>> wasn't aware that numpy could be 10-20x slower than a cython loop, if
>> we're talking about the base numpy library--so core operations. I'm
> 
> The problem with NumPy is the temporaries needed -- if you want to compute
> 
> A + B + np.sqrt(D)
> 
> then, if the arrays are larger than cache size (a couple of megabytes), 
> then each of those operations will first transfer the data in and out 
> over the memory bus. I.e. first you compute an element of sqrt(D), then 
> the result of that is put in system memory, then later the same number 
> is read back in order to add it to an element in B, and so on.
> 
> The compute-to-bandwidth ratio of modern CPUs is between 30:1 and 
> 60:1... so in extreme cases it's cheaper to do 60 additions than to 
> transfer a single number from system memory.
> 
> It is much faster to only transfer an element (or small block) from each 
> of A, B, and D to CPU cache, then do the entire expression, then 
> transfer the result back. This is easy to code in Cython/Fortran/C and 
> impossible with NumPy/Python.
> 
> This is why numexpr/Theano exists.

Well, I can't speak for Theano (it is quite more general than numexpr, and more geared towards using GPUs, right?), but this was certainly the issue that make David Cooke to create numexpr.  A more in-deep explanation about this problem can be seen in:

http://www.euroscipy.org/talk/1657

which includes some graphical explanations.

-- Francesc Alted