On Tue, Jul 16, 2013 at 2:34 PM, Arink Verma <arinkverma@gmail.com> wrote:
>Each ndarray does two mallocs, for the obj and buffer. These could be combined into 1 - just allocate the total size and do some pointer >arithmetic, then set OWNDATA to false.
So, that two mallocs has been mentioned in project introduction. I got that wrong. 

On further thought/reading the code, it appears to be more complicated than that, actually.

It looks like (for a non-scalar array) we have 2 calls to PyMem_Malloc: 1 for the array object itself, and one for the shapes + strides. And, one call to regular-old malloc: for the data buffer.

(Mysteriously, shapes + strides together have 2*ndim elements, but to hold them we allocate a memory region sized to hold 3*ndim elements. I'm not sure why.)

And contrary to what I said earlier, this is about as optimized as it can be without breaking ABI. We need at least 2 calls to malloc/PyMem_Malloc, because the shapes+strides may need to be resized without affecting the much larger data area. But it's tempting to allocate the array object and the data buffer in a single memory region, like I suggested earlier. And this would ALMOST work. But, it turns out there is code out there which assumes (whether wisely or not) that you can swap around which data buffer a given PyArrayObject refers to (hi Theano!). And supporting this means that data buffers and PyArrayObjects need to be in separate memory regions.

>magnitude more time in inefficient loop selection and unnecessary writes to the FP control word?
loop selection, contribute around 2~3% in time. I implemented cache with PyThreadState_GetDict() but it didnt help. 
Even generating prepopulated dict/list in code_generator/generate_umath.py is not helping, 


Here, it the distribution of time, on addition operations. All memory related and BuildValue operations cost more than 7%, rest looping ones are around 2-3%:
  • PyUFunc_AddititonTypeResolver(7.6%)
    • SimpleBinaryOperationTypeResolver(6.2%)
  • execute_legacy_ufunc_loop(20.7%)
    • trivial_three_operand_loop(8.6%)  ,this will be around 3.4% when pr #3521 get merged
    • PYArray_NewFromDescr(7.3%)
    • PyUFunc_DefaultLegacyInnerLoopSelector(2.5%)
  • PyUFunc_GetPyValues(12.0%)
    • _extract_pyvals(9.2%)
  • PyArray_Return(14.3%)
Hmm, you prodded me into running those numbers again to see :-)

At http://www.arinkverma.in/2013/06/finding-bottleneck-in-pythonnumpy.html you say that you're using a Python compiled with --with-pydebug. Is this true? If so then stop! You want numpy compiled with generic debugging information ("-g" on gcc), and maybe it helps to have Python compiled with "-g" as well. But --with-pydebug goes much further -- it actually changes the Python interpreter in many ways to add lots of expensive self-checks. On my machine simple operations like "[]" (allocate a list) or "1.0 + 1.0" go about 4x slower when I use Ubuntu's python-dbg package (which is compiled with --with-pydebug). You can't trust speed measurements you get from a --with-pydebug build.

Anyway, I'm using 64-bit python2.7 from Ubuntu's repo, self-compiled numpy master, with this measurement code:

import ctypes
profiler = ctypes.CDLL("libprofiler.so.0")
def loop(n):
    import numpy as np
    print "Numpy:", np.__version__
    x = np.asarray([1.0, 2.0])
    for i in xrange(n):
        x + x
profiler.ProfilerStart("/tmp/master-array-float64-add.prof")
loop(10000000)
profiler.ProfilerStop()

Graph attached.

Notice:
- because my benchmark has a 2-element array instead of a scalar array, the special-case scalar return logic (PyArray_Return etc.) disappears. This makes all percentages a bit higher in my graph, because the operation is overall faster.

- PyArray_NewFromDescr does indeed take 11.6% of the time, but it's not clear why. Half that time is directly inside PyArray_NewFromDescr, not in any sub-calls to malloc-related functions. Also, you see a lot more time in array_alloc than I do, which may be caused by --with-pydebug.

Taking a closer look with google-pprof --disasm=PyArray_NewFromDescr (also attached), it looks like the major cost here is, bizarrely enough, the calculation of the array size?! Out of 338 cumulative samples in this function, I count 175 that are associated with various div/mul instructions, while all the mallocs together take only 164 (= 5.6% of total time).

This is pretty bizarre for a bunch of 1-dimensional 2-element arrays!?

- PyUFunc_AdditionTypeResolver takes 10.9% of the time, and PyUFunc_DefaultLegacyInnerLoopSelector takes another 4.2% of the time, and this pretty absurd considering that we're talking about locating the float64 + float64 loop, which should not require any complicated logic. This should be like 0.1% or something. I'm not surprised that PyThreadState_GetDict() doesn't help -- doing dict lookups was probably was more expensive than the thing you replaced! But some sort of simple table lookup scheme that reduces loop lookup to chasing a few pointers should be totally doable.

- We're spending 13.6% of the time in PyUFunc_getfperr. I'm pretty sure that a lot of this is totally wasted time, because we implement both 'set' and 'clear' operations as 'set+clear', making them twice as costly as necessary.

(Eventually it would be even better if we could disable this logic entirely for integer arrays, and for when the user has turned off fp error reporting. But neither of these would help for this simple float+float benchmark.)

- _extract_pyvals and PyUFunc_GetPyValues (not sure why they aren't linked in my graph, but they seem to be the same code) together use >11% of time. This is also completely silly -- all this time is spent on doing elaborate stuff to look up entries in a python dict, extract them, and convert them into, like, some C level bitmasks. And then doing that again and again on every operation. Instead we should convert this stuff to a C values once, when they're set in the first place, and stash those C values directly into a thread-local variable. See PyThread_*_key in pythread.h for a raw TLS implementation that's always available (and which is what PyThreadState_GetDict() is built on top of). The documentation is in the Python source distribution in comments in Python/thread.c.

-n