[Python-Dev] obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

Sun Jun 2 05:37:06 EDT 2019

On Sun, 2 Jun 2019 00:56:52 -0500
Tim Peters <tim.peters at gmail.com> wrote:
> 
> But because O is only trying to deal with small (<= 512 bytes)
> requests, it can use a very fast method based on trivial address
> arithmetic to find the size of an allocated block by just reading it
> up from the start of the (4K) "pool" the address belongs to.  T can't
> do that - it appears to need to look up the address in a more
> elaborate radix tree, to find info recording the size of the block
> (which may be just about anything - no upper limit).

The interesting thing here is that in many situations, the size is
known up front when deallocating - it is simply not communicated to the
deallocator because the traditional free() API takes a sole pointer,
not a size.  But CPython could communicate that size easily if we
would like to change the deallocation API.  Then there's no bother
looking up the allocated size in sophisticated lookup structures.

I'll note that jemalloc provides such APIs:
http://jemalloc.net/jemalloc.3.html

"""The dallocx() function causes the memory referenced by ptr to be
made available for future allocations.

The sdallocx() function is an extension of dallocx() with a size
parameter to allow the caller to pass in the allocation size as an
optimization."""

Regards

Antoine.

> 
> > (well, of course, obmalloc doesn't have to worry about concurrent
> > scenarios, which explains some of the simplicity)  
> 
> Right, T has a different collection of free lists for each thread. so
> on each entry has to figure out which collection to use (and so
> doesn't need to lock).  That's not free.  O only has one collection,
> and relies on the GIL.
> 
> Against that, O burns cycles worrying about something else:  because
> it was controversial when it was new, O thought it was necessary to
> handle free/realloc calls even when passed addresses that had actually
> been obtained from the system malloc/realloc.  The T docs I saw said
> "don't do that - things will blow up in mysterious ways".
> 
> That's where O's excruciating "address_in_range()" logic comes from.
> While that's zippy and scales extremely well (it doesn't depend on how
> many objects/arenas/pools exist), it's not free, and is a significant
> part of the "fast path" expense for both allocation and deallocation.
> 
> It also limits us to a maximum pool size of 4K (to avoid possible
> segfaults when reading up memory that was actually obtained from the
> system malloc/realloc), and that's become increasingly painful:  on
> 64-bit boxes the bytes lost to pool headers increased, and O changed
> to handle requests up to 512 bytes instead of its original limit of
> 256.  O was intended to supply "a bunch" of  usable blocks per pool,
> not just a handful.  We "should" really at least double the pool and
> arena sizes now.
> 
> I don't think we need to cater anymore to careless code that mixes
> system memory calls with O calls (e.g., if an extension gets memory
> via `malloc()`, it's its responsibility to call `free()`), and if not
> then `address_in_range()` isn't really necessary anymore either, and
> then we could increase the pool size.  O would, however, need a new
> way to recognize when its version of malloc punted to the system
> malloc.
> 
> BTW, one more:  last I saw T never returns memory to "the system", but
> O does - indeed, the parent thread here was all about _enormous_ time
> waste due to that in O ;-)  That's not free either, but doesn't affect
> O's fast paths.