[Python-Dev] [numpy wishlist] PyMem_*Calloc

Mon Apr 14 17:36:55 CEST 2014

On Sun, Apr 13, 2014, at 22:39, Nathaniel Smith wrote:
> Hi all,
> 
> The new tracemalloc infrastructure in python 3.4 is super-interesting
> to numerical folks, because we really like memory profiling. Numerical
> programs allocate a lot of memory, and sometimes it's not clear which
> operations allocate memory (some numpy operations return views of the
> original array without allocating anything; others return copies). So
> people actually use memory tracking tools[1], even though
> traditionally these have been pretty hacky (i.e., just checking RSS
> before and after each line is executed), and numpy has even grown its
> own little tracemalloc-like infrastructure [2], but it only works for
> numpy data.
> 
> BUT, we also really like calloc(). One of the basic array creation
> routines in numpy is numpy.zeros(), which returns an array full of --
> you guessed it -- zeros. For pretty much all the data types numpy
> supports, the value zero is represented by the bytestring consisting
> of all zeros. So numpy.zeros() usually uses calloc() to allocate its
> memory.
> 
> calloc() is more awesome than malloc()+memset() for two reasons.
> First, calloc() for larger allocations is usually implemented using
> clever VM tricks, so that it doesn't actually allocate any memory up
> front, it just creates a COW mapping of the system zero page and then
> does the actual allocation one page at a time as different entries are
> written to. This means that in the somewhat common case where you
> allocate a large array full of zeros, and then only set a few
> scattered entries to non-zero values, you can end up using much much
> less memory than otherwise. It's entirely possible for this to make
> the difference between being able to run an analysis versus not.
> memset() forces the whole amount of RAM to be committed immediately.
> 
> Secondly, even if you *are* going to touch all the memory, then
> calloc() is still faster than malloc()+memset(). The reason is that
> for large allocations, malloc() usually does a calloc() no matter what
> -- when you get a new page from the kernel, the kernel has to make
> sure you can't see random bits of other processes's memory, so it
> unconditionally zeros out the page before you get to see it. calloc()
> knows this, so it doesn't bother zeroing it again. malloc()+memset(),
> by contrast, zeros the page twice, producing twice as much memory
> traffic, which is huge.
> 
> SO, we'd like to route our allocations through PyMem_* in order to let
> tracemalloc "see" them, but because there is no PyMem_*Calloc, doing
> this would force us to give up on the calloc() optimizations.
> 
> The obvious solution is to add a PyMem_*Calloc to the API. Would this
> be possible? Unfortunately it would require adding a new field to the
> PyMemAllocator struct, which would be an ABI/API break; PyMemAllocator
> is exposed directly in the C API and passed by value:
>   https://docs.python.org/3.5/c-api/memory.html#c.PyMemAllocator
> (Too bad we didn't notice this a few months ago before 3.4 was
> released :-(.) I guess we could just rename the struct in 3.5, to
> force people to update their code. (I guess there aren't too many
> people who would have to update their code.)

Well, the allocator API is not part of the stable ABI, so we can change
it if we want.

> 
> Thoughts?

I think the request is completely reasonable.