[Python-Dev] [numpy wishlist] PyMem_*Calloc

Thu Apr 17 18:12:21 CEST 2014

On Wed, Apr 16, 2014 at 7:35 PM, Victor Stinner
<victor.stinner at gmail.com> wrote:
> Hi,
>
> 2014-04-16 7:51 GMT-04:00 Julian Taylor <jtaylor.debian at googlemail.com>:
>> In NumPy what we want is the tracing, not the exchangeable allocators.
>
> Did you read the PEP 445? Using the new malloc API, in fact you can
> have both: install new allocators and set up hooks on allocators.
> http://legacy.python.org/dev/peps/pep-0445/

The context here is that there's been some followup discussion on the
numpy list about whether there are cases where we need even more
exotic memory allocators than calloc(), and what to do about it if so.

(Thread: http://mail.scipy.org/pipermail/numpy-discussion/2014-April/069935.html
)

One case that has come up is when efficient use of SIMD instructions
requires better-than-default alignment (e.g. malloc() usually gives
something like 8 byte alignment, but if you're using an instruction
that operates on 32 bytes at once you might need your array to have 32
byte alignment). Most (all?) OSes provide an extended version of
malloc that allows one to request more alignment (posix_memalign on
POSIX, _aligned_malloc on windows), and C11 standardizes this as
aligned_alloc. An important feature of these functions is that they
allocate from the same heap that 'malloc' does, i.e., when done with
the aligned memory you call free() -- there's no such thing as
aligned_free(). This means that if your program uses these functions
then swapping out malloc/free without swapping out aligned_alloc will
produce undesireable results.

Numpy does not currently use aligned allocation, and it's not clear
how important it is -- on older x86 it matters, but not so much on
current CPUs, but when the next round of x86 SIMD instructions are
released next year it might matter again, and apparently on popular
IBM supercomputers it matters (but less on newer versions)[1,2], and
who knows what will happen with ARM. It's a bit of a mess. But if
we're messing about with APIs it seems worth thinking about.

[1] http://mail.scipy.org/pipermail/numpy-discussion/2014-April/069965.html
[2] http://mail.scipy.org/pipermail/numpy-discussion/2014-April/069967.html

A second possible use case is:

>> my_hugetlb_alloc(size)
>>     p = mmap('hugepagefs', ..., MAP_HUGETLB);
>>     PyMem_Register_Alloc(p, size, __func__, __line__);
>>     return p
>>
>> my_hugetlb_free(p);
>>     PyMem_Register_Free(p, __func__, __line__);
>>     munmap(p, ...);
>
> This is exactly how tracemalloc works. The advantage of the PEP 445 is
> that you have a null overhead when tracemalloc is disabled. There is
> no need to check if a trace function is present or not.

I think the key thing about this example is that you would *never*
want to use MAP_HUGETLB as a generic replacement for malloc(). Huge
pages can have all kinds of weird quirky limitations, and are
certainly unsuited for small allocations. BUT they can provide huge
speed wins if used for certain specific allocations in certain
programs. (In case anyone needs a reminder what "huge pages" even are:
http://lwn.net/Articles/374424/)

If I wrote a Python library to make it easy to use huge pages with
numpy, then I might well want the allocations I was making to be
visible to tracemalloc, even though they would not be going through
malloc/free.

(For that matter -- should calls to os.mmap be calling some
tracemalloc hook in general? There are lots of cases where mmap is
really doing memory allocation -- it's very useful for shared memory
and stuff too.)

---

My current impression is something like:

- From the bug report discussion it sounds like calloc() is useful
even in core Python, so it makes sense to go ahead with that
regardless.
- Now that aligned_alloc has been standardized, it might make sense to
add it to the PyMemAllocator struct too.
- And it might also make sense to have an API by which a Python
library can say to tracemalloc: "hey FYI I just allocated something
using my favorite weird exotic method", like in the huge pages
example. This is a fully generic mechanism, so it could act as a kind
of "safety valve" for future weirdnesses.

All numpy *needs* to support its current and immediately foreseeable
usage is calloc(). But I'm a bit nervous about getting trapped -- if
the PyMem_* machinery implements calloc(), and we switch to using it
and advertise tracemalloc support to our users, and then later it
turns out that we need aligned_alloc or similar, then we'll be stuck
unless and until we can get at least one of these other changes into
CPython upstream, and that will suck for all of us.

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org