[Python-Dev] obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

Sun Jun 2 01:56:52 EDT 2019

[Antoine Pitrou, replying to Thomas Wouters]
> Interesting that a 20-year simple allocator (obmalloc) is able to do
> better than the sophisticated TCMalloc.

It's very hard to beat obmalloc (O) at what it does.  TCMalloc (T) is
actually very similar where they overlap, but has to be more complex
because it's trying to do more than O.

In either case, for small objects "the fast path" consists merely of
taking the first block of memory off a singly-linked size-segregated
free list.  For freeing, the fast path is just linking the block back
to the front of the appropriate free list.  What _could_ be faster?  A
"bump allocator" allocates faster (just increment a highwater mark),
but creates worlds of problems when freeing.

But because O is only trying to deal with small (<= 512 bytes)
requests, it can use a very fast method based on trivial address
arithmetic to find the size of an allocated block by just reading it
up from the start of the (4K) "pool" the address belongs to.  T can't
do that - it appears to need to look up the address in a more
elaborate radix tree, to find info recording the size of the block
(which may be just about anything - no upper limit).

> (well, of course, obmalloc doesn't have to worry about concurrent
> scenarios, which explains some of the simplicity)

Right, T has a different collection of free lists for each thread. so
on each entry has to figure out which collection to use (and so
doesn't need to lock).  That's not free.  O only has one collection,
and relies on the GIL.

Against that, O burns cycles worrying about something else:  because
it was controversial when it was new, O thought it was necessary to
handle free/realloc calls even when passed addresses that had actually
been obtained from the system malloc/realloc.  The T docs I saw said
"don't do that - things will blow up in mysterious ways".

That's where O's excruciating "address_in_range()" logic comes from.
While that's zippy and scales extremely well (it doesn't depend on how
many objects/arenas/pools exist), it's not free, and is a significant
part of the "fast path" expense for both allocation and deallocation.

It also limits us to a maximum pool size of 4K (to avoid possible
segfaults when reading up memory that was actually obtained from the
system malloc/realloc), and that's become increasingly painful:  on
64-bit boxes the bytes lost to pool headers increased, and O changed
to handle requests up to 512 bytes instead of its original limit of
256.  O was intended to supply "a bunch" of  usable blocks per pool,
not just a handful.  We "should" really at least double the pool and
arena sizes now.

I don't think we need to cater anymore to careless code that mixes
system memory calls with O calls (e.g., if an extension gets memory
via `malloc()`, it's its responsibility to call `free()`), and if not
then `address_in_range()` isn't really necessary anymore either, and
then we could increase the pool size.  O would, however, need a new
way to recognize when its version of malloc punted to the system
malloc.

BTW, one more:  last I saw T never returns memory to "the system", but
O does - indeed, the parent thread here was all about _enormous_ time
waste due to that in O ;-)  That's not free either, but doesn't affect
O's fast paths.