[Numpy-discussion] Memory allocation cleanup

Fri Jan 10 11:03:11 EST 2014

On Fri, Jan 10, 2014 at 9:18 AM, Julian Taylor
<jtaylor.debian at googlemail.com> wrote:
> On Fri, Jan 10, 2014 at 3:48 AM, Nathaniel Smith <njs at pobox.com> wrote:
>>
>> Also, none of the Py* interfaces implement calloc(), which is annoying
>> because it messes up our new optimization of using calloc() for
>> np.zeros. [...]
>
>
> Another thing that is not directly implemented in Python is aligned
> allocation. This is going to get increasingly important with the advent
> heavily vectorized x86 CPUs (e.g. AVX512 is rolling out now) and the C
> malloc being optimized for the oldish SSE (16 bytes). I want to change the
> array buffer allocation to make use of posix_memalign and C11 aligned_malloc
> if available to avoid some penalties when loading from non 32 byte aligned
> buffers. I could imagine it might also help coprocessors and gpus to have
> higher alignments, but I'm not very familiar with that type of hardware.
> The allocator used by the Python3.4 is plugable, so we could implement our
> special allocators with the new API, but only when 3.4 is more widespread.
>
> For this reason and missing calloc I don't think we should use the Python
> API for data buffers just yet. Any benefits are relatively small anyway.

It really would be nice if our data allocations would all be visible
to the tracemalloc library though, somehow. And I doubt we want to
patch *all* Python allocations to go through posix_memalign, both
because this is rather intrusive and because it would break python -X
tracemalloc.

How certain are we that we want to switch to aligned allocators in the
future? If we don't, then maybe it makes to ask python-dev for a
calloc interface; but if we do, then I doubt we can convince them to
add aligned allocation interfaces, and we'll need to ask for something
else (maybe a "null" allocator, which just notifies the python memory
tracking machinery that we allocated something ourselves?).

It's not obvious to me why aligning data buffers is useful - can you
elaborate? There's no code simplification, because we always have to
handle the unaligned case anyway with the standard unaligned
startup/cleanup loops. And intuitively, given the existence of such
loops, alignment shouldn't matter much in practice, since the most
that shifting alignment can do is change the number of elements that
need to be handled by such loops by (SIMD alignment value / element
size). For doubles, in a buffer that has 16 byte alignment but not 32
byte alignment, this means that worst case, we end up doing 4
unnecessary non-SIMD operations. And surely that only matters for very
small arrays (for large arrays such constant overhead will amortize
out), but for small arrays SIMD doesn't help much anyway? Probably I'm
missing something, because you actually know something about SIMD and
I'm just hand-waving from first principles :-). But it'd be nice to
understand the reasoning for why/whether alignment really helps in the
numpy context.

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org