Re: [Numpy-discussion] Speedup by avoiding memory alloc twice in scalar array

16 Jul 2013

      On Tue, Jul 16, 2013 at 2:34 PM, Arink Verma  wrote:
...
...
Each ndarray does two mallocs, for the obj and buffer. These could be
combined into 1 - just allocate the total size and do some pointer
arithmetic, then set OWNDATA to false.
So, that two mallocs has been mentioned in project introduction. I got
that wrong.
On further thought/reading the code, it appears to be more complicated than
that, actually.

It looks like (for a non-scalar array) we have 2 calls to PyMem_Malloc: 1
for the array object itself, and one for the shapes + strides. And, one
call to regular-old malloc: for the data buffer.

(Mysteriously, shapes + strides together have 2*ndim elements, but to hold
them we allocate a memory region sized to hold 3*ndim elements. I'm not
sure why.)

And contrary to what I said earlier, this is about as optimized as it can
be without breaking ABI. We need at least 2 calls to malloc/PyMem_Malloc,
because the shapes+strides may need to be resized without affecting the
much larger data area. But it's tempting to allocate the array object and
the data buffer in a single memory region, like I suggested earlier. And
this would ALMOST work. But, it turns out there is code out there which
assumes (whether wisely or not) that you can swap around which data buffer
a given PyArrayObject refers to (hi Theano!). And supporting this means
that data buffers and PyArrayObjects need to be in separate memory regions.
...
magnitude more time in inefficient loop selection and unnecessary writes
to the FP control word?
loop selection, contribute around 2~3% in time. I implemented cache with PyThreadState_GetDict()
but it didnt help.
Even generating prepopulated dict/list in code_generator/generate_umath.py is
not helping,
Here, it the distribution of time, on addition operations. All memory
related and BuildValue operations cost more than 7%, rest looping ones are
around 2-3%:
- PyUFunc_AddititonTypeResolver(7.6%)
   - *SimpleBinaryOperationTypeResolver(6.2%)*
- *execute_legacy_ufunc_loop(20.7%)*
   - trivial_three_operand_loop(8.6%)  ,this will be around 3.4% when pr #
      3521 https://github.com/numpy/numpy/pull/3521 get merged
      - *PYArray_NewFromDescr(7.3%)*
      - PyUFunc_DefaultLegacyInnerLoopSelector(2.5%)
- PyUFunc_GetPyValues(12.0%)
   - *_extract_pyvals(9.2%)*
   - *PyArray_Return(14.3%)*
Hmm, you prodded me into running those numbers again to see :-)
At http://www.arinkverma.in/2013/06/finding-bottleneck-in-pythonnumpy.htmlyou
say that you're using a Python compiled with --with-pydebug. Is this
true? If so then stop! You want numpy compiled with generic debugging
information ("-g" on gcc), and maybe it helps to have Python compiled with
"-g" as well. But --with-pydebug goes much further -- it actually changes
the Python interpreter in many ways to add lots of expensive self-checks.
On my machine simple operations like "[]" (allocate a list) or "1.0 + 1.0"
go about 4x slower when I use Ubuntu's python-dbg package (which is
compiled with --with-pydebug). You can't trust speed measurements you get
from a --with-pydebug build.

Anyway, I'm using 64-bit python2.7 from Ubuntu's repo, self-compiled numpy
master, with this measurement code:

import ctypes
profiler = ctypes.CDLL("libprofiler.so.0")
def loop(n):
    import numpy as np
    print "Numpy:", np.__version__
    x = np.asarray([1.0, 2.0])
    for i in xrange(n):
        x + x
profiler.ProfilerStart("/tmp/master-array-float64-add.prof")
loop(10000000)
profiler.ProfilerStop()

Graph attached.

Notice:
- because my benchmark has a 2-element array instead of a scalar array, the
special-case scalar return logic (PyArray_Return etc.) disappears. This
makes all percentages a bit higher in my graph, because the operation is
overall faster.

- PyArray_NewFromDescr does indeed take 11.6% of the time, but it's not
clear why. Half that time is directly inside PyArray_NewFromDescr, not in
any sub-calls to malloc-related functions. Also, you see a lot more time in
array_alloc than I do, which may be caused by --with-pydebug.

Taking a closer look with google-pprof --disasm=PyArray_NewFromDescr (also
attached), it looks like the major cost here is, bizarrely enough, the
calculation of the array size?! Out of 338 cumulative samples in this
function, I count 175 that are associated with various div/mul
instructions, while all the mallocs together take only 164 (= 5.6% of total
time).

This is pretty bizarre for a bunch of 1-dimensional 2-element arrays!?

- PyUFunc_AdditionTypeResolver takes 10.9% of the time, and
PyUFunc_DefaultLegacyInnerLoopSelector takes another 4.2% of the time, and
this pretty absurd considering that we're talking about locating the
float64 + float64 loop, which should not require any complicated logic.
This should be like 0.1% or something. I'm not surprised that
PyThreadState_GetDict() doesn't help -- doing dict lookups was probably was
more expensive than the thing you replaced! But some sort of simple table
lookup scheme that reduces loop lookup to chasing a few pointers should be
totally doable.

- We're spending 13.6% of the time in PyUFunc_getfperr. I'm pretty sure
that a lot of this is totally wasted time, because we implement both 'set'
and 'clear' operations as 'set+clear', making them twice as costly as
necessary.

(Eventually it would be even better if we could disable this logic entirely
for integer arrays, and for when the user has turned off fp error
reporting. But neither of these would help for this simple float+float
benchmark.)

- _extract_pyvals and PyUFunc_GetPyValues (not sure why they aren't linked
in my graph, but they seem to be the same code) together use >11% of time.
This is also completely silly -- all this time is spent on doing elaborate
stuff to look up entries in a python dict, extract them, and convert them
into, like, some C level bitmasks. And then doing that again and again on
every operation. Instead we should convert this stuff to a C values once,
when they're set in the first place, and stash those C values directly into
a thread-local variable. See PyThread_*_key in pythread.h for a raw TLS
implementation that's always available (and which is what
PyThreadState_GetDict() is built on top of). The documentation is in the
Python source distribution in comments in Python/thread.c.

-n

Re: [Numpy-discussion] Speedup by avoiding memory alloc twice in scalar array

Nathaniel Smith