[Numpy-discussion] High-quality memory profiling for numpy in python 3.5 / volunteers needed

Tue Apr 15 12:39:11 EDT 2014

On Tue, Apr 15, 2014 at 4:08 PM, Julian Taylor <
jtaylor.debian at googlemail.com> wrote:
> On Tue, Apr 15, 2014 at 3:07 PM, Nathaniel Smith <njs at pobox.com> wrote:
>> On Tue, Apr 15, 2014 at 12:06 PM, Julian Taylor
>> <jtaylor.debian at googlemail.com> wrote:
>>>> Good news, though! python-dev is in favor of adding calloc() to the
>>>> core allocation interfaces, which will let numpy join the party. See
>>>> python-dev thread:
>>>> https://mail.python.org/pipermail/python-dev/2014-April/133985.html
>>>>
>>>> It would be especially nice if we could get this into 3.5, since it
>>>> seems likely that lots of numpy users will be switching to 3.5 when it
>>>> comes out, and having a good memory tracing infrastructure there
>>>> waiting for them make it even more awesome.
>>>>
>>>> Anyone interested in picking this up?
>>>> http://bugs.python.org/issue21233
>>>
>>> Hi,
>>> I think it would be a better idea to instead of API functions for one
>>> different type of allocator we get access to use the python hooks
>>> directly with whatever allocator we want to use.
>>
>> Unfortunately, that's not how the API works. The way that third-party
>> tracers register a 'hook' is by providing a new implementation of
>> malloc/free/etc. So there's no general way to say "please pretend to
>> have done a malloc".
>>
>> I guess we could potentially request the addition of
>> fake_malloc/fake_free functions.
>
> Unfortunate, looking at the pep it seems either you have a custom
> allocator or you have tracing but not both (unless you trace
> yourself).
> This seems like quite a limitation.

I don't think this is right - notice the PyMem_GetAllocator function, which
lets you grab the old allocator. This means you can write a tracing
"allocator" which just does its tracing and then delegates to the old
allocator. (And looking at _tracemalloc.c this does seem to be how it
works.) This means that any full allocator replacement has to be enabled
first before any tracing allocator is enabled, but that's okay, because a
full allocator has to be inserted *very* early in any case (like, before
any allocations have happened) and can never be removed, so this doesn't
seem so bad.

OTOH I don't think they've really thought about the case of stacking
multiple tracing allocators. tracemalloc.stop() just unconditionally resets
the allocator to whatever it was when tracemalloc.start() was called, and
there's no guidelines on how to handle the lifetime of the ctx pointer. I'm
not sure these issues cause any problems in practice though.

> Maybe it would have been more flexible if instead python provided
> three functions:
>
> PyMem_RegisterAlloc(size);
> PyMem_RegisterReAlloc(size);
> PyMem_RegisterFree(size);
> + possibly nogil variantes
> These functions call into registered tracing functions (registered
> e.g. by tracemalloc.start()) or do nothing.
>
> Our allocator (and pythons) then just always calls these functions and
> continues doing its stuff.

You'd need to add some void* arguments as well -- tracemalloc actually
tracks every allocation independently, so you can do things like ask "which
line of code was responsible for allocating the largest portion of the
memory that is still in use".

And unfortunately once you add these arguments the resulting signatures
don't quite match regular malloc/realloc/free (you have to pass a void*
into malloc instead of receiving one), so we can't just define a PYMEM_NULL
domain. (Or rather, we could, but then it would have to return an opaque
void* used only for memory tracking, and we'd have to keep track of this
alongside every allocation we did, and that would suck.)

>>> This would allow as to for example use aligned memory allocators which
>>> might be relevant for the new cpu instruction sets with up to 64 byte
>>> wide registers
>>
>> I think we might have had this conversation before, but I don't
>> remember how it went... did you have some explanation about how this
>> could matter in principle? We have to write code to handle unaligned
>> (or imperfectly aligned) arrays regardless, so aligned allocation
>> doesn't affect maintainability. And regarding speed, I can't see how
>> an extra instruction here or there could make a big difference on
>> small arrays, since the effect should be overwhelmed by interpreter
>> overhead and memory stalls (not much time for prefetch to come into
>> play on small arrays), but OTOH large arrays are usually page-aligned
>> in practice, and if not then any extra start-up overhead will be
>> amortized out by their size.
>
> yes we already had this conversation :)
> if you have two or more arrays not aligned the same way you can only
> align one of them via peeling, the others will always have to be
> accessed unaligned.
> But it probably does not matter anymore with newer cpus, I should
> probably just throw out my old core2 where it does :)

Oh right! Yes, that makes sense, sorry :-)

On the one hand it would be nice to actually know whether posix_memalign is
important, before making api decisions on this basis. OTOH we've made it
this far without, and apparently the processors for which it might or might
not matter won't be out for some time, so we could revisit things then I
guess...

Anyone know how picky ARM NEON is about alignment?

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140415/89e73617/attachment.html>