Have a big machine and spare time? Here's a possible Python bug.

There's a Stackoverflow report[1] I suspect is worth looking into, but it requires far more RAM (over 80GB) than I have). The OP whittled it down to a reasonably brief & straightforward pure Python 3 program. It builds a ternary search tree, with perhaps a billion nodes. The problem is that it "takes forever" for Python to reclaim the tree storage when it goes out of scope (the author waited at least hours). Alas, the OP said it takes about 45 minutes to build the tree, and the problem goes away if the tree is materially smaller. So it takes a long time just to try once. With a tree about 10x smaller, for me it takes about 45 seconds for Python to reclaim the tree storage. The tree is deep enough that the "trashcan" may be implicated, and the node objects are small enough that obmalloc carves them out of a (relatively) great many arenas. Those are two ways in which Python _may_ be to blame. The sheer number of objects involved may be provoking edge cases we normally never see. But, for a start, it would be good to know if anyone else can actually reproduce the problem. [1] https://stackoverflow.com/questions/56228799/python-hangs-indefinitely-tryin...

I have only 32GB mem, but AWS has larger memory machine! Linux perf shows here is bottleneck: https://github.com/python/cpython/blob/master/Objects/obmalloc.c#L1784-L1819 obmalloc sorts arenas by number of free pools. If there are many arenas, and memory block is freed by random order, this sorting become O(N^2). That's too bad. I'm trying address order instead. Regards, -- Inada Naoki <songofacandy@gmail.com>

On Thu, May 23, 2019 at 11:52 AM Inada Naoki <songofacandy@gmail.com> wrote:
I have only 32GB mem, but AWS has larger memory machine!
Be aware that on larger cloud vendor instances, the backing vCPUs and memory you get allocated might or might not be spread opaquely across multiple sockets and/or NUMA nodes of the backing hardware. Some of them have options where you can allocate raw hardware as well so you can opt to lock your process to run within just one NUMA node and ensure hardware locality for your performance debugging. I'm pointing this out in case you experience any mystery jitters in your benchmark results. -- Joni Orponen

1. perf shows 95% of CPU time is eaten by _PyObject_Free, not kernel space. 2. This loop is cleary hot: https://github.com/python/cpython/blob/51aa35e9e17eef60d04add9619fe2a7eb9383... I can attach the process by gdb and I confirmed many arenas have same nfreepools. I believe this is not jitter caused from NUMA or something else in cloud. -- Inada Naoki <songofacandy@gmail.com>

On 23May2019 0542, Inada Naoki wrote:
It's relatively easy to test replacing our custom allocators with the system ones, yes? Can we try those to see whether they have the same characteristic? Given the relative amount of investment over the last 19 years [1], I wouldn't be surprised if most system ones are at least as good for our needs now. Certainly Windows HeapAlloc has had serious improvements in that time to help with fragmentation and small allocations. Cheers, Steve [1]: https://github.com/python/cpython/blob/51aa35e9e17eef60d04add9619fe2a7eb9383...

For the record, result for 10M nodes, Ubuntu 18.04 on AWS r5a.4xlarge: $ local/bin/python3 t1.py # default 1138.1123778309993 -- end train, start del 688.7927911250008 -- end $ arena-1m/bin/python3 t1.py # Changed ARENA_SIZE to 1MiB 1085.3363994170013 -- end train, start del 84.57135540099989 -- end $ PYTHONMALLOC=malloc local/bin/python3 t1.py 1157.4882792789995 -- end train, start del 27.919834706000074 -- end $ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 PYTHONMALLOC=malloc local/bin/python3 t1.py 1098.4383037820007 -- end train, start del 117.93938426599925 -- end In this case, glibc malloc is the fastest. glibc is know to weak about fragmentation. But algorithm to avoid fragmentation is just an overhead in this script. Regards, -- Inada Naoki <songofacandy@gmail.com>

Le ven. 24 mai 2019 à 09:41, Inada Naoki <songofacandy@gmail.com> a écrit :
688 => 84 looks like an interesting speedup. Would it be technically possible to make ARENA_SIZE configurable at runtime? Using the new PEP 587 preinitialization, it shouldn't be too hard to hard a new command line option and/or an environment variable to tune the memory allocator: https://www.python.org/dev/peps/pep-0587/#preinitialization-with-pypreconfig Victor -- Night gathers, and now my watch begins. It shall not end until my death.

[Inada Naoki <songofacandy@gmail.com>]
For the record, result for 10M nodes, Ubuntu 18.04 on AWS r5a.4xlarge:
I'm unclear on what "nodes" means. If you mean you changed 27M to 10M in this line: for token in random_strings(27_000_000): that's fine, but there are about 40 times more than that `Node` objects created by the program. So if you did change that to 10_000_000, you'd have about 400M `Node` objects, consuming about 80x that many bytes of RAM (or about 32GB).
Nice! I assume these are seconds. On Stackoverflow, the OP reported that boosting ARENA_SIZE the same way cut deallocation time in his original program to about 13 minutes.
While the OP reported, for the original program: """ With PYTHONMALLOC=malloc, insertion takes nearly twice as long, but deallocation only takes 2 minutes! """ The "nearly twice as long" for building the tree is in sharp contrast to what you saw, but there's about 3x as much of everything in the original program, and, e.;g., it's certainly possible malloc is tripping over fragmentation then that obmalloc avoids.
I can't say. I'll note that the approach I very briefly sketched before (restructure the list of arenas to partition it into multiple lists partitioned by number of free pools) "should make" obmalloc competitive with malloc here (it would eliminate all searches, except perhaps (depending on how gonzo the code is) a rare need to search for "the first" non-empty list).

[Tim]
But it's also intrusive, breaking up a simple linked list into a mutli-headed beast. That affects all code using it, not just the parts mutating it. So instead I suggest leaving `usable_arenas` entirely alone, but add a vector of "search fingers", say, static struct arenaobject* nfp2lasta[MAX_NUMBER_OF_FREE_POOLS_POSSIBLE + 1]; mapping a count of free pools to (a pointer to) the last arena object in usable_arenas with that number of free pools. Then the search loop in py_malloc_free() can be replaced with a single array lookup.: it leaps directly to where the search loop ends now. For example, if we free a pool in an arena that had 23 free pools, then the arena object's count of free pools has to be increased to 24, and the arena object unlinked from its current position in usable_arenas and inserted immediately after nfp2lasta[23]. Note that usable_arenas is a doubly linked list, so there's no _need_ at all to iterate over it. Each node in the list knows what's immediately before and after it. And since a free pool count can only increase by 1, it's always correct to move the arena immediately after the last arena with the same original free pool count. Key invariants: 1. nfp2lasta[n] is NULL if and only if no arena in usable_arenas has nfreepools == n. 2. nfp2lasta[pa->nfreepools] == pa if and only if pa is the only arena in usable_arenas with that many free pools. So, in the example above, nfp2lasta[23] has to be set to NULL if and only if the arena in question was the only one with 23 free pools (which can be determined cheaply via invariant #2). And nfp2lasta[24] has to be set to point to the arena in question if and only if nfp2lasta[24] is NULL. Tricky bit: if the arena in question is the only one with a given original free pool count, no pointers in arena objects need to be changed at all. Following the sketch above literally would leave you trying to insert it after itself, which wouldn't end well ;-) Anyway, this "feels like a right solution" to me, eliminating all searches with simple (albeit delicate) constant-time code, and requiring code changes only where an arena's number of free pools can change.

[Tim]
Ack! Scratch that. I need a nap :-( In fact if that equality holds, it means that nfp2lasta entry has to change if pa is moved and pa->prevarena has the same count of free pools. So forget about the explanation and just think about the goal - the details will work themselves out ;-)

On Thu, May 23, 2019 at 5:15 PM Steve Dower <steve.dower@python.org> wrote:
FYI, and I've mentioned this at PyCon to a few people (might've been you, Steve, I don't remember) -- but at Google we've experimented with disabling obmalloc when using TCMalloc (a faster and thread-aware malloc, which makes a huge difference within Google when dealing with multi-threaded C++ libraries), both using the usual Python benchmarks and real-world code with real-world workloads (a core part of YouTube, for example), all on Linux. There's still a significant benefit to using obmalloc when using glibc's malloc, and also a noticeable benefit when using TCMalloc. There are certainly cases where it doesn't matter much, and there may even be cases where the overhead of obmalloc isn't worth it, but I do believe it's still a firm net benefit.
-- Thomas Wouters <thomas@python.org> Hi! I'm an email virus! Think twice before sending your email to help me spread!

On Fri, 24 May 2019 14:23:21 +0200 Thomas Wouters <thomas@python.org> wrote:
Interesting that a 20-year simple allocator (obmalloc) is able to do better than the sophisticated TCMalloc. (well, of course, obmalloc doesn't have to worry about concurrent scenarios, which explains some of the simplicity) Regards Antoine.

[Antoine Pitrou, replying to Thomas Wouters]
Interesting that a 20-year simple allocator (obmalloc) is able to do better than the sophisticated TCMalloc.
It's very hard to beat obmalloc (O) at what it does. TCMalloc (T) is actually very similar where they overlap, but has to be more complex because it's trying to do more than O. In either case, for small objects "the fast path" consists merely of taking the first block of memory off a singly-linked size-segregated free list. For freeing, the fast path is just linking the block back to the front of the appropriate free list. What _could_ be faster? A "bump allocator" allocates faster (just increment a highwater mark), but creates worlds of problems when freeing. But because O is only trying to deal with small (<= 512 bytes) requests, it can use a very fast method based on trivial address arithmetic to find the size of an allocated block by just reading it up from the start of the (4K) "pool" the address belongs to. T can't do that - it appears to need to look up the address in a more elaborate radix tree, to find info recording the size of the block (which may be just about anything - no upper limit).
(well, of course, obmalloc doesn't have to worry about concurrent scenarios, which explains some of the simplicity)
Right, T has a different collection of free lists for each thread. so on each entry has to figure out which collection to use (and so doesn't need to lock). That's not free. O only has one collection, and relies on the GIL. Against that, O burns cycles worrying about something else: because it was controversial when it was new, O thought it was necessary to handle free/realloc calls even when passed addresses that had actually been obtained from the system malloc/realloc. The T docs I saw said "don't do that - things will blow up in mysterious ways". That's where O's excruciating "address_in_range()" logic comes from. While that's zippy and scales extremely well (it doesn't depend on how many objects/arenas/pools exist), it's not free, and is a significant part of the "fast path" expense for both allocation and deallocation. It also limits us to a maximum pool size of 4K (to avoid possible segfaults when reading up memory that was actually obtained from the system malloc/realloc), and that's become increasingly painful: on 64-bit boxes the bytes lost to pool headers increased, and O changed to handle requests up to 512 bytes instead of its original limit of 256. O was intended to supply "a bunch" of usable blocks per pool, not just a handful. We "should" really at least double the pool and arena sizes now. I don't think we need to cater anymore to careless code that mixes system memory calls with O calls (e.g., if an extension gets memory via `malloc()`, it's its responsibility to call `free()`), and if not then `address_in_range()` isn't really necessary anymore either, and then we could increase the pool size. O would, however, need a new way to recognize when its version of malloc punted to the system malloc. BTW, one more: last I saw T never returns memory to "the system", but O does - indeed, the parent thread here was all about _enormous_ time waste due to that in O ;-) That's not free either, but doesn't affect O's fast paths.

On Sun, 2 Jun 2019 00:56:52 -0500 Tim Peters <tim.peters@gmail.com> wrote:
The interesting thing here is that in many situations, the size is known up front when deallocating - it is simply not communicated to the deallocator because the traditional free() API takes a sole pointer, not a size. But CPython could communicate that size easily if we would like to change the deallocation API. Then there's no bother looking up the allocated size in sophisticated lookup structures. I'll note that jemalloc provides such APIs: http://jemalloc.net/jemalloc.3.html """The dallocx() function causes the memory referenced by ptr to be made available for future allocations. The sdallocx() function is an extension of dallocx() with a size parameter to allow the caller to pass in the allocation size as an optimization.""" Regards Antoine.

[Antoine Pitrou <solipsis@pitrou.net>]
That could work (to make it possible to increase obmalloc's pool size). Except ...
obmalloc doesn't intend to be a general-purpose allocator - it only aims at optimizing "small" allocations, punting to the system for everything beyond that. Unless the size is _always_ passed in (on every free() and realloc() spelling it supports), an "optimization" doesn't help it much. It needs a bulletproof way to determine whether it, or system malloc/realloc, originally obtained an address passed in. If the size is always passed in, no problem (indeed, a single bit could suffice). But if it's ever possible that the size won't be passed in, all the runtime machinery to figure that out on its own needs to be able to handle all addresses. Like now: if the size were passed in, obmalloc could test the size instead of doing the `address_in_range()` dance(*). But if it's ever possible that the size won't be passed in, all the machinery supporting `address_in_range()` still needs to be there, and every obmalloc spelling of malloc/realloc needs to ensure that machinery will work if the returned address is passed back to an obmalloc free/realloc spelling without the size. The "only"problem with address_in_range is that it limits us to a maximum pool size of 4K. Just for fun, I boosted that to 8K to see how likely segfaults really are, and a Python built that way couldn't even get to its first prompt before dying with an access violation (Windows-speak for segfault). Alas, that makes it hard to guess how much value there would be for Python programs if the pool size could be increased - can't even get Python started. We could eliminate the pool size restriction in many ways. For example, we could store the addresses obtained from the system malloc/realloc - but not yet freed - in a set, perhaps implemented as a radix tree to cut the memory burden. But digging through 3 or 4 levels of a radix tree to determine membership is probably significantly slower than address_in_range. I can think of a way to do it slightly faster than (but related to) address_in_range, but it would (on a 64-bit box) require adding 24 more bytes for each system-malloc/realloc allocation. 8 of those bytes would be pure waste, due to that the Alignment Gods appear to require 16-byte alignment for every allocation on a 64-bit box now. In stark contrast, the extra memory burden of the current address_in_range is an insignificant 8 bytes per _arena_ (256 KB, and "should be" larger now). Another approach: keep address_as_range as-is, but add new internal structure to larger pools, to repeat the arena index every 4KB. But that fights somewhat against the goal of larger pools. Etc. ;-) (*) Actually not quite true. If a "large" object is obtained from obmalloc now (meaning it actually came from the system malloc), then cut back to a "small" size by a realloc, it _remains_ under the control of the system malloc now. Passing in the newer "small" size to a free() later would cause obmalloc to get fatally confused about that.

On Thu, 6 Jun 2019 13:57:37 -0500 Tim Peters <tim.peters@gmail.com> wrote:
But my response was under the assumption that we would want obmalloc to deal with all allocations. Which is more or less required anyway to have an efficient GC that doesn't have to walk linked lists and access memory in random order to iterate over known objects. Regards Antoine.

[Antoine Pitrou <solipsis@pitrou.net>]
But my response was under the assumption that we would want obmalloc to deal with all allocations.
I didn't know that. I personally have no interest in that: if we want an all-purpose allocator, there are several already to choose from. There's no reason to imagine we could write a better one.
As the parent thread showed, obmalloc does at least as well as any general-purpose allocator known _for Python's purposes_ (a great many (de)allocations of "small" objects). Already explained too that size-segregated linked free lists are the core of _why_ it does so well. Besides making carving off, and freeing, blocks dirt cheap, linked lists also naturally support recycling memory blocks in MRU ("freshness in cache") order. But I don't know what you mean by "access memory in random order to iterate over known objects". obmalloc never needs to iterate over known objects - indeed, it contains no code capable of doing that.. Our cyclic gc does, but that's independent of obmalloc. Over on Discourse, Neil is speculating about using radix trees for cyclic gc instead of _its_ linked lists In obmalloc, allocated memory regions aren't linked at all. It's free regions that are linked, and helpfully so in MRU order.

On Thu, 6 Jun 2019 16:03:03 -0500 Tim Peters <tim.peters@gmail.com> wrote:
It's not. Cyclic GC needs its own linked lists *because* the allocator doesn't allow it to iterate over allocated objects. Regards Antoine.

[Tim]
[Antoine]
It's not. Cyclic GC needs its own linked lists *because* the allocator doesn't allow it to iterate over allocated objects.
The doubly linked lists in gc primarily support efficient _partitioning_ of objects for gc's purposes (a union of disjoint sets, with constant-time moving of an object from one set to another, and constant-time union of disjoint sets). "All objects" is almost never interesting to it (it is only when the oldest non-frozen generation is being collected). Between collections, the partitioning is by generation. During a collection, the youngest generations are combined into one, and then that's sub-partitioned in various ways as collection goes along, ending with a partition into reachable and unreachable objects. In between, ephemeral partitions are constructed (e.g., the set of "tentatively unreachable" objects). None of that was driven by obmalloc's (or malloc's) inability to iterate over objects. Doubly linked lists were the obvious way to implement the required operations on partitions efficiently and simply. In any case, it appears to be "a feature" now that people can use any flavor of the malloc family they like in CPython, so I expect that any attempt to tie cyclic gc to a specific flavor of malloc would be a very hard sell. Which, BTW, was the intended meaning of "independent": cyclic gc right now couldn't care less which version of malloc a user plugs in - nor could obmalloc care less which cyclic gc algorithm is used.

On Thu, 6 Jun 2019 17:26:17 -0500 Tim Peters <tim.peters@gmail.com> wrote:
Right. But the oldest generation is precisely the pain point, since full collections can induce very long pauses. IOW, perhaps it would be fine to keep dedicated lists for the young generations, while doing a heap walk for the full collection case.
Depends on the benefits of course ;-) Most people use some pre-built binaries that "naturally" come with obmalloc enabled. Of course, it's only a very small minority of applications that hit real performance issues with the GC - and in probably many of those cases tuning the full collection threshold can alleviate the problem still. Regards Antoine.

On 2019-06-06, Tim Peters wrote:
My current idea is to put partitioning flags on the interior radix tree nodes. If you mark an object as "finalizer reachable", for example, it would mark all the nodes on the path from the root with that flag. Then, when you want to iterate over all the GC objects with a flag, you can avoid uninteresting branches of the tree. For generations, maybe tracking them at the pool level is good enough. Interior nodes can track generations too (i.e. the youngest generation contained under them). My gut feeling is that the prev/next pointer updates done by move_unreachable() and similar functions must be quite expensive. Doing the traversal with an explicit stack is a lot less elegant but I think it should be faster. At least, when you are dealing with a big set of GC objects that don't fit in the CPU cache. Regards, Neil

On 2019-06-06, Tim Peters wrote:
We can almost make it work for GC objects, the use of obmalloc is quite well encapsulated. I think I intentionally designed the PyObject_GG_New/PyObject_GC_Del/etc APIs that way. Quick and dirty experiment is here: https://github.com/nascheme/cpython/tree/gc_malloc_free_size The major hitch seems my new gc_obj_size() function. We can't be sure the 'nbytes' passed to _PyObject_GC_Malloc() is the same as what is computed by gc_obj_size(). It usually works but there are exceptions (freelists for frame objects and tuple objects, for one) A nasty problem is the weirdness with PyType_GenericAlloc() and the sentinel item. _PyObject_GC_NewVar() doesn't include space for the sentinel but PyType_GenericAlloc() does. When you get to gc_obj_size(), you don't if you should use "nitems" or "nitems+1". I'm not sure how the fix the sentinel issue. Maybe a new type slot or a type flag? In any case, making a change like my git branch above would almost certainly break extensions that don't play nicely. It won't be hard to make it a build option, like the original gcmodule was. Then, assuming there is a performance boost, people can enable it if their extensions are friendly.
If we can make the above idea work, you could set the pool size to 8K without issue. A possible problem is that the obmalloc and gcmalloc arenas are separate. I suppose that affects performance testing.
You are likely correct. I'm hoping to benchmark the radix tree idea. I'm not too far from having it working such that it can replace address_in_range(). Maybe allocating gc_refs as a block would offset the radix tree cost vs address_in_range(). If the above idea works, we know the object size at free() and realloc(), we don't need address_in_range() for those code paths. Regards, Neil

To be clearer, while knowing the size of allocated objects may be of some use to some other allocators, "not really" for obmalloc. That one does small objects by itself in a uniform way, and punts everything else to the system malloc family. The _only_ thing it wants to know on a free/realloc is which of those two ways delivered the address to begin with. Knowing the original size would be an indirect way to accomplish that, but it really only needs 1 bit of info. If you're using a radix tree to implement - in effect - a dict mapping addresses to "stuff", 1 flag bit in the "stuff" would be ideal for that. If you haven't already, I assume you'll soon come around to believe you really should be tracking the addresses of all gc'ed objects (not just the "small" ones).
If the above idea works, we know the object size at free() and realloc(), we don't need address_in_range() for those code paths.
Two things: first, a single bit would not only be sufficient, it would be _better_ than knowing the object size. If the bit says the object came from the system, the size is of no use at all. It the bit says it came from obmalloc, what's _really_ wanted is the index of the object's "size class" in the vector of free lists, and that's already directly available in the object's pool header (various parts of which have to be read up &/or updated regardless). Second, if that bit were available, `address_in_range()` could be thrown away - obmalloc's free and realloc entries are the only places it's used (or is of any conceivable use to obmalloc). For the current obmalloc, I have in mind a different way (briefly. let s pool span a number of 4K pages, but teach pools about the page structure so that address_in_range() continues to work after trivial changes (read the arena index from the containing page's start rather than from the containing pool's)). Not ideal, but looks remarkably easy to do, and captures the important part (more objects in a pool -> more times obmalloc can remain in its fastest "all within the pool" paths).
At the start, it was the _reachable_ objects that were moved out of the collected generation list rather than the unreachable objects. Since it's (in all my experience) almost all objects that survive almost all collections in almost all programs, that was an enormous amount of moving overall. So long I ago I rewrote that code to move the _un_reachable objects instead. Debug output showed countless billions of pointer updates saved across the programs I tried at the time, but the timing difference was in the noise. Probably a big part of "the problem": when collecting the oldest generation in a program with "a lot" of objects, no approach at all will "fit in cache". We have to crawl the entire object graph then. Note that the test case in the parent thread had over a billion class instances, and it was way whittled down(!) from the OP's real program. But I don't _think_ that's typical quite yet ;-) , and:
I certainly agree gc_refs is the big-bang-for-the-buck thing to target. There's no way to avoid mutating that bunches and bunches of times, even in small programs, and reducing the number of dirtied cache lines due to that "should" pay off.

[Tim\
And now there's a PR that removes obmalloc's limit on pool sizes, and, for a start, quadruples pool (and arena!) sizes on 64-bit boxes: https://github.com/python/cpython/pull/13934 https://bugs.python.org/issue37211 As the PR says, """ It would be great to get feedback from 64-bit apps that do massive amounts of small-object allocations and deallocations. """ :-)

On 2019-06-09, Tim Peters wrote:
And now there's a PR that removes obmalloc's limit on pool sizes, and, for a start, quadruples pool (and arena!) sizes on 64-bit boxes:
Neat.
I've done a little testing the pool overhead. I have an application that uses many small dicts as holders of data. The function: sys._debugmallocstats() is useful to get stats for the obmalloc pools. Total data allocated by obmalloc is 262 MB. At the 4*PAGE_SIZE pool size, the wasted space due to partly filled pools is only 0.18%. For 16*PAGE_SIZE pools, 0.71%. I have a set of stats for another program. In that case, total memory allocated by obmalloc is 14 MB. For 4*PAGE_SIZE pools, wasted space is 0.78% of total. At 16*PAGE_SIZE, it is 2.4%. Based on that small set of data, using 4*PAGE_SIZE seems conservative. As I'm sure you realize, making pools bigger will waste actual memory, not just virtual address space because you write the arena pointer to each OS page. I want to do performance profiling using Linux perf. That should show where the hotspot instructions in the obmalloc code. Maybe that will be useful to you. Another thought about address_in_range(): some operating systems allow you to allocate memory a specific alignments. Or, you can even allocate a chunk of memory at a fixed memory location if you do the correct magic incantation. I noticed that Go does that. I imagine doing that has a bunch of associated challenges with it. However, if we could control the alignment and memory location of obmalloc arenas, we would not have the segv problem of address_in_range(). It's probably not worth going down that path due to the problems involved. Regards, Neil

[Neil Schemenauer <nas-python@arctrix.com>]
Definitely a tradeoff here: increasing the number of pages per pool is a pure win for objects of very popular size classes, but a pure loss for objects of unpopular size classes. New comments about POOL_SIZE in the patch briefly hint at that.
Based on that small set of data, using 4*PAGE_SIZE seems conservative.
Which is a nice way of saying "pulled out of Tim's ass" ;-)
Yes, but mostly no! It remains the case that obmalloc neither writes nor reads anything in an arena until a malloc/realloc call actually needs to return a pointer to a never-before accessed page. Writing the arena index is NOT done to pages when the arena is allocated, or even when a new pool is carved off an arena, but lazily, one page at a time, lowest address to highest, as fresh pages actually need to be returned to the caller. So arena pointers aren't actually written more frequently than when using pools of just one page, as is done now (which _still_ writes the arena index into each page). Subtlety: currently, if you pass a nonsense address to address_in_range that happens to be in one of the arena's pools, address_in_range() says "yes". However, after the patch, it may well return "no" if the address is in one of the pool's pages that hasn't yet been returned to a malloc/realloc caller (in which case the arena index hasn't yet been stored at the start of the page). I don't care about that, because it's 100% a logic error to pass an address to free/realloc that wasn't obtained from a previous malloc/realloc call. So it's a change in undefined behavior.
Would be good to know, yes! But may also depend on the app.
I'm unclear on how that could be exploited. It's not addresses that come from arenas that create segv headaches, it's addresses that come from the system's malloc/realloc. If we were, e.g., to pick a maximum amount of address space obmalloc can use in advance, we could live with a single arena of that size, and just check whether an address is within it. All the problems come from that address space may alternate, any number of times, between addresses in obmalloc arenas and addresses from the system malloc's internals.

On Sun, Jun 2, 2019 at 7:57 AM Tim Peters <tim.peters@gmail.com> wrote:
Is this really feasible in a world where the allocators can be selected (and the default changed) at runtime? And what would be an efficient way of detecting allocations punted to malloc, if not address_in_range? Getting rid of address_in_range sounds like a nice idea, and I would love to test how feasible it is -- I can run such a change against a wide selection of code at work, including a lot of third-party extension modules, but I don't see an easy way to do it right now. -- Thomas Wouters <thomas@python.org> Hi! I'm an email virus! Think twice before sending your email to help me spread!

[Tim]
[Thomas Wouters <thomas@python.org>]
Is this really feasible in a world where the allocators can be selected (and the default changed) at runtime?
I think so. See the "Memory Management" section of the Python/C API Reference Manual. It's always been "forbidden" to, e.g., allocate a thing with PyMem_New() but release it with free(). Ditto mixing a PyMem_Raw... allocator with a PyMem... deallocator, or PyObject... one. Etc. A type's tp_dealloc implementation should damn well which memory family the type's allocator used, However, no actual proposal on the table changes any "fact on the ground" here. They're all as forgiving of slop as the status quo.
And what would be an efficient way of detecting allocations punted to malloc, if not address_in_range?
_The_ most efficient way is the one almost all allocators used long ago: use some "hidden" bits right before the address returned to the user to store info about the block being returned. Like 1 bit to distinguish between "obmalloc took this out of one of its pools" and "obmalloc got this from PyMem_Raw... (whatever that maps to - obmalloc doesn't care)". That would be much faster than what we do now. But on current 64-bit boxes, "1 bit" turns into "16 bytes" to maintain alignment, so space overhead becomes 100% for the smallest objects obmalloc can return :-( Neil Schemenauer takes a different approach in the recent "radix tree arena map for obmalloc" thread here. We exchanged ideas on that until it got to the point that the tree levels only need to trace out prefixes of obmalloc arena addresses. That is, the new space burden of the radix tree appears quite reasonably small. It doesn't appear to be possible to make it faster than the current address_in_range(), but in small-scale testing so far speed appears comparable.
Neil's branch is here: https://github.com/nascheme/cpython/tree/obmalloc_radix_tree It's effectively a different _implementation_ of the current address_in_range(), one that doesn't ever need to read possibly uninitialized memory, and couldn't care less about the OS page size. For the latter reason, it's by far the clearest way to enable expanding pool size above 4 KiB. My PR also eliminates the pool size limitation: https://github.com/python/cpython/pull/13934 but at the cost of breaking bigger pools up internally into 4K regions so the excruciating current address_in_range black magic still works. Neil and I are both keen _mostly_ to increase pool and arena sizes. The bigger they are, the more time obmalloc can spend in its fastest code paths. A question we can't answer yet (or possibly ever) is how badly that would hurt Python returning arenas to the system, in long-running apps the go through phases of low and high memory need. I don't run anything like that - not a world I've ever lived in. All my experiments so far say, for programs that are neither horrible nor wonderful in this respect: 1. An arena size of 4 KiB is most effective for that. 2. There's significant degradation in moving even to 8 KiB arenas. 3. Which continues getting worse the larger the arenas. 4. Until reaching 128 KiB, at which point the rate of degradation falls a lot. So the current 256 KiB arenas already suck for such programs. For "horrible" programs, not even tiny 4K arenas help much. For "wonderful" programs, not even 16 MiB arenas hurt arena recycling effectiveness. So if you have real programs keen to "return memory to the system" periodically, it would be terrific to get info about how changing arena size affects their behavior in that respect. My PR uses 16K pools and 1M arenas, quadrupling the status quo. Because "why not?" ;-) Neil's branch has _generally_, but not always, used 16 MiB arenas. The larger the arenas in his branch, the smaller the radix tree needs to grow.

On 2019-06-21, Tim Peters wrote:
If you can test vs some real-world programs, that would be great. I was trying to run some tests this afternoon. Testing with Python 3.8+ is a pain because of the PyCode_New and tp_print changes. I've just added two fixes to the head of the obmalloc_radix_tree branch so that you can compile code generated by old versions of Cython. Without those fixes, building 3rd party extensions can be a real pain.
Currently I have it like your big pool branch (16 KB, 1MB).

For those who would like to test with something compatible with Python 3.7.3, I made re-based branches here: https://github.com/nascheme/cpython/tree/obmalloc_radix_v37 https://github.com/nascheme/cpython/tree/obmalloc_big_pools_v37 They should be ABI compatible with Python 3.7.3. So, if you just re-build the "python" executable, you don't have to rebuild anything else. Both those use the same arena/pool sizes and they both have Tim's arena thrashing fix.

On Fri, 21 Jun 2019 17:18:18 -0500 Tim Peters <tim.peters@gmail.com> wrote:
There's a fundamental problem here: you can't be sure that all allocators reserve such space. If some allocator doesn't, it can return a pointer just at the very start of the page, and if obmalloc tries to look at "a few bits before" that address, it could very well page-fault.
One open question is the cache efficiency of the two approaches. Intuitively, address_in_range() will often look at exactly the same cache line (since a significant number of allocation will share the same "page prefix"). Apparently, this benefit may be offset by cache aliasing issues. Cache aliasing can also be mitigated by the fact that CPU caches are most of the time N-way with N > 1 (but N generally remains small, from 2 to 8, for L1 and L2 caches). So I guess the a priori answer is "it's complicated" :-) I must also thank both you and Neil for running these experiments, even though sometimes I may disagree about the conclusions :-) Regards Antoine.

[Thomas]
And what would be an efficient way of detecting allocations punted to malloc, if not address_in_range?
[Tim]
[Antoine]
I snipped some technical but crucial context in my reply to Thomas: this was assuming users are following the documented requirement to never mix memory calls from different families. What you describe certainly could happen in "illegal" code that. e.g., obtained a block from the system malloc, but passed it to obmalloc to free. Which, in reality, works fine today, although the docs forbid it. (I'm not sure, but there _may_ be some mode of running today that actually enforces the doc requirements.) In the other world, where code is playing by the rules, if an obmalloc malloc/;realloc spelling is called, and it needs to punt to a different allocator, no problem: first it boosts the size request so it has room to store "the bit" it needs before the address it actually returns to the client. Then it's "legal" only to free that memory with an obmalloc spelling of free() later - obmalloc reads "the bit", sees "oh - that's not my memory!", and adjusts the pointer accordingly on _its_ call to the spelling of free() corresponding to the memory family obmalloc() used to get the memory to begin with.
Neil Schemenauer takes a different approach in the recent "radix tree arena map for obmalloc" thread here. ...
I believe we haven't seen a program yet that used more than one node at the tree's top level :-) But who knows? mmap() and VirtualAlloc() don't appear to make any _guarantees_ that the high-order bits of returned addresses aren't entirely random. In real life so far, they always appear to be zeroes. While x86 has a 64-bit virtual address space, the hardware "only" supports 48 bits of physical address space, and I haven't seen a virtual address yet where any of the top 16 bits are set. AMD requires that the top 16 bits of virtual addresses be copies of bit 2**47. Blah blah blah - for the foreseeable future, the top level of the tree has a very easy job. And Neil keenly observed that the top level of the tree can be _declared_ as being very broad (and so suck up a lot of the leading bits), because it's a file static and is effectively just an address space reservation (at least on Linux) until nodes in it actually get used.
Indeed it is.
I must also thank both you and Neil for running these experiments, even though sometimes I may disagree about the conclusions :-)
Well, there aren't any conclusions yet - just seeing the same things repeatedly. Over the weekend, Neil ran many variations of a longish-running "real job" related to his work, which goes through phases of processing bulk database operations and "trying to" release the space (details are complicated). Arena recycling was essentially non-existent in either branch (my PR or the radix tree). In 3.7.3 it _appeared_ to recycle hundreds of thousands of arenas, but on closer examination they were virtually all of the "useless arena thrashing" flavor. The arenas in use were almost always within one of the highwater mark. But it got much better in the branches if arenas were shrunk to a tiny 8 KiB. Which is just another instance of the "256 KiB arenas are already way too fat for effective arena recycling unless the program is exceptionally friendly in its dynamic memory use patterns" observation. We haven't seen a case yet where 1 MiB arenas do materially worse than 256 KiB ones in this respect. Speed is generally a wash between the branches, although they consistently appear to be faster (by a little, not a lot) than 3.7.3. The radix tree generally appears to be a little more memory-frugal than my PR (presumably because my need to break "big pools" into 4K chunks, while the tree branch doesn't, buys the tree more space to actually store objects than it costs for the new tree). We need more "real jobs", and especially ones where arena recycling is actually doing good in 3.7.3 (which can be hard to tell, because the "number of arenas recycled" output in 3.7.3 is vastly inflated by arena-thrashing noise).

[Tim]
It depends a whole lot on the size classes of the most popular objects. A program below to compute it all. For a 64-bit box using 3.8 alignment, and 16 KiB pools: pages per pool 4 pool size 16,384 alignment 16 The first output block: size 16 SQ 1012 1.2% PR 1018 0.6% RT 1021 0.3% SQ is the status quo: we have to use four separate 4 KiB pools, and each burns 48 bytes for a pool header. PR: my PR. There's one pool spanning 4 pages, with 48 bytes for a pool header in the first page, and 16 bytes to store the arena index in each of the other 3 pages. RT: the radix tree. One 16 KiB block that only "wastes" 48 bytes for the pool header. The first number on each line is the number of objects we can get from a "big pool" under that scheme. The second number is the % of total pool space that cannot be use for client objects. So, in the above, PR is a substantial improvement over SQ, and RT a less substantial improvement over PR. Regardless of size class, PR never does worse than SQ, and RT never worse than PR. The most dramatic difference is in the largest size class: size 512 SQ 28 12.5% PR 28 12.5% RT 31 3.1% RT is a huge win there. And while it's generally true that RT's advantages are more striking in the larger size classes, it doesn't always crush. For example, in the 2nd-largest size class, it doesn't matter at all which scheme is used: size 496 SQ 32 3.1% PR 32 3.1% RT 32 3.1% However, in the 3rd-largest size class, RT crushes it again: size 480 SQ 32 6.2% PR 32 6.2% RT 34 0.4% I trust the general principle at work here is too obvious to need explanation ;-) Anyway, these differences can really add up when there are millions of objects passed out from a size class where RT has an advantage. That's very attractive to me. On the other hand, this _effectively_ make arenas even larger (they can contain more objects), which apparently makes it even less likely that arenas can eventually be returned to the system. But, on the third hand, I've seen no evidence yet that increasing arena size matters much at all to that after hitting 128 KiB arenas (smaller than what we already use). "Uncooperative" programs essentially recycle nothing regardless, and "happy" programs essentially recycle almost as many arena bytes with 1 MiB arenas as with 8 KiB arenas. Here's the code: PAGES_PER_POOL = 4 ALIGNMENT = 16 # change to 8 for < Python 3.8 PAGE_SIZE = 2**12 POOL_SIZE = PAGE_SIZE * PAGES_PER_POOL POOL_HEADER_SIZE = 48 def from_block(size, blocksize, overhead): return (blocksize - overhead) // size def from_first_page(size, *, pagesize=PAGE_SIZE): return from_block(size, pagesize, POOL_HEADER_SIZE) # using multiple 4K one-page pools - status quo def nobj_4K(size): return from_first_page(size) * PAGES_PER_POOL # using the PR def nobj_PR(size): return (from_first_page(size) + from_block(size, PAGE_SIZE, ALIGNMENT) * (PAGES_PER_POOL - 1)) # using the radix tree branch def nobj_RT(size): return from_first_page(size, pagesize=POOL_SIZE) print("pages per pool", PAGES_PER_POOL) print(f"pool size {POOL_SIZE:,}") print("alignment", ALIGNMENT) for size in range(ALIGNMENT, 512 + 1, ALIGNMENT): print("size", size) for tag, f in (("SQ", nobj_4K), ("PR", nobj_PR), ("RT", nobj_RT), ): nobj = f(size) waste = POOL_SIZE - nobj * size print(f" {tag} {nobj:4} {waste/POOL_SIZE:5.1%}")

For the record, there's another contender in the allocator competition now: https://github.com/microsoft/mimalloc/ Regards Antoine. On Mon, 24 Jun 2019 00:20:03 -0500 Tim Peters <tim.peters@gmail.com> wrote:

[Antoine Pitrou <solipsis@pitrou.net>]
Thanks! From a quick skim, most of it is addressing things obmalloc doesn't: 1) Efficient thread safety (we rely on the GIL). 2) Directly handling requests of all sizes (we only deal with <= 512 bytes). 3) Funky stuff to help refcounting languages that want to guarantee that freeing an object takes bounded time (the problem is that large objects can contain an unbounded number of other objects that will then also die - so mimalloc has complications to interact with client-supplied functions that parcel out the tree of objects killed by a decref, so the freeing can be done incrementally over time). I don't think it's been explicitly pointed out before, but #2 is an alternative to my PR and Neil's radix tree. If we insist that programs play by the rules, and obmalloc controls every piece of memory it hands out, then the new definition of address_in_range() becomes #define address_in_range(p, pool) true ;-) This deserves more thought than I've ever given to it. Leaving aside all the above, they emphasize this: """ Many allocators use a single free list per size class which can lead to bad spatial locality where objects belonging to a single structure can be spread out over the entire heap. ... To improve the spatial locality of allocation, mimalloc use free list sharding where the heap is divided into pages (per size-class) with a free list per page (where pages are usually 64KiB for small objects). """ Which is exactly what obmalloc already does, except we call their "pages" "pools", and ours are 16x smaller (<= Python 3.8) or 4x smaller (my PR). The call our "arenas" "segments", and theirs are a minimum of 4 MiB. Their motivations for making these "large" are the same as mine: maximize the time they can stay in the very zippy "fast paths". Something I'm torn on: within a pool, obmalloc has two ways to find the next free block. The pool has its own block free list (as in mimalloc), but _also_ has a "bump allocator": The pool is carved into blocks one at a time, low address to high, whenever the pool's free list is NULL. This was part of Vladimir's desire to never touch a piece of memory before it's actually needed to satisfy a client request. But it does complicate and slow the code on the second-fastest allocation path (when the pool's free list _is_ NULL). The conditional "OK, the free list is NULL - so is there room for another bump allocation?" is a waste of time after all the pool's blocks have been passed out at least once. The bump allocator can never succeed again then, but we keep testing for it forever. The alternative is to break the pool into blocks, and link them into a free list, in one gulp when a size class is assigned to a free pool. That mutates all the blocks in the pool (to store the list pointers), even if all but one will never be used. In return, the code for testing whether the bump allocator can find room, and the pool header members supporting the bump allocator, could be thrown out. That's what mimalloc appears to do, and they said it sped things up a bit overall. But I think I'll leave that one for someone else ;-)

On Tue, Jun 25, 2019 at 5:49 AM Antoine Pitrou <solipsis@pitrou.net> wrote:
It's a very strong competitor! $ ./python -m pyperf compare_to pymalloc.json mimalloc.json -G --min-speed=3 Faster (14): - spectral_norm: 202 ms +- 5 ms -> 176 ms +- 3 ms: 1.15x faster (-13%) - unpickle: 19.7 us +- 1.9 us -> 17.6 us +- 1.3 us: 1.12x faster (-11%) - json_dumps: 17.1 ms +- 0.2 ms -> 15.7 ms +- 0.2 ms: 1.09x faster (-8%) - json_loads: 39.0 us +- 2.6 us -> 36.2 us +- 1.1 us: 1.08x faster (-7%) - crypto_pyaes: 162 ms +- 1 ms -> 150 ms +- 1 ms: 1.08x faster (-7%) - regex_effbot: 3.62 ms +- 0.04 ms -> 3.38 ms +- 0.01 ms: 1.07x faster (-7%) - pickle_pure_python: 689 us +- 53 us -> 650 us +- 5 us: 1.06x faster (-6%) - scimark_fft: 502 ms +- 2 ms -> 478 ms +- 2 ms: 1.05x faster (-5%) - float: 156 ms +- 2 ms -> 149 ms +- 1 ms: 1.05x faster (-5%) - pathlib: 29.0 ms +- 0.5 ms -> 27.7 ms +- 0.4 ms: 1.05x faster (-4%) - mako: 22.4 ms +- 0.1 ms -> 21.6 ms +- 0.2 ms: 1.04x faster (-4%) - scimark_sparse_mat_mult: 6.21 ms +- 0.04 ms -> 5.99 ms +- 0.08 ms: 1.04x faster (-3%) - xml_etree_parse: 179 ms +- 5 ms -> 173 ms +- 3 ms: 1.04x faster (-3%) - sqlalchemy_imperative: 42.0 ms +- 0.9 ms -> 40.7 ms +- 0.9 ms: 1.03x faster (-3%) Benchmark hidden because not significant (46): ...

On Thu, Jul 4, 2019 at 8:09 PM Antoine Pitrou <solipsis@pitrou.net> wrote:
Ah, interesting. Were you able to measure the memory footprint as well?
Hmm, it is not good. mimalloc uses MADV_FREE so it may affect to some benchmarks. I will look it later. ``` $ ./python -m pyperf compare_to pymalloc-mem.json mimalloc-mem.json -G Slower (60): - logging_format: 10.6 MB +- 384.2 kB -> 27.2 MB +- 21.3 kB: 2.58x slower (+158%) - logging_simple: 10028.4 kB +- 371.2 kB -> 22.2 MB +- 24.9 kB: 2.27x slower (+127%) - regex_dna: 13.3 MB +- 19.1 kB -> 27.0 MB +- 12.1 kB: 2.02x slower (+102%) - json_dumps: 8351.8 kB +- 19.8 kB -> 15.2 MB +- 18.0 kB: 1.87x slower (+87%) - regex_v8: 8434.6 kB +- 20.9 kB -> 14.4 MB +- 11.0 kB: 1.75x slower (+75%) - unpack_sequence: 7521.0 kB +- 17.0 kB -> 9980.8 kB +- 24.7 kB: 1.33x slower (+33%) - hexiom: 7412.2 kB +- 19.0 kB -> 9307.4 kB +- 8004 bytes: 1.26x slower (+26%) - xml_etree_process: 12.2 MB +- 26.3 kB -> 15.0 MB +- 28.9 kB: 1.23x slower (+23%) - genshi_text: 10.2 MB +- 24.0 kB -> 12.5 MB +- 24.8 kB: 1.22x slower (+22%) - crypto_pyaes: 7602.2 kB +- 35.7 kB -> 9242.8 kB +- 7873 bytes: 1.22x slower (+22%) - tornado_http: 24.9 MB +- 72.1 kB -> 30.1 MB +- 33.0 kB: 1.21x slower (+21%) - chameleon: 15.8 MB +- 24.5 kB -> 19.1 MB +- 23.4 kB: 1.21x slower (+21%) - genshi_xml: 10.9 MB +- 24.0 kB -> 12.9 MB +- 19.6 kB: 1.18x slower (+18%) - go: 8662.6 kB +- 16.4 kB -> 10082.8 kB +- 26.2 kB: 1.16x slower (+16%) - pathlib: 8863.6 kB +- 30.2 kB -> 10229.8 kB +- 19.4 kB: 1.15x slower (+15%) - scimark_fft: 7473.4 kB +- 14.4 kB -> 8606.0 kB +- 28.6 kB: 1.15x slower (+15%) - scimark_lu: 7463.2 kB +- 15.1 kB -> 8569.8 kB +- 28.6 kB: 1.15x slower (+15%) - pidigits: 7380.2 kB +- 20.0 kB -> 8436.0 kB +- 24.2 kB: 1.14x slower (+14%) - scimark_monte_carlo: 7354.4 kB +- 18.2 kB -> 8398.8 kB +- 27.0 kB: 1.14x slower (+14%) - scimark_sparse_mat_mult: 7889.8 kB +- 20.1 kB -> 9006.2 kB +- 29.4 kB: 1.14x slower (+14%) - scimark_sor: 7377.2 kB +- 18.9 kB -> 8402.0 kB +- 29.0 kB: 1.14x slower (+14%) - chaos: 7693.0 kB +- 33.0 kB -> 8747.6 kB +- 10.5 kB: 1.14x slower (+14%) - richards: 7364.2 kB +- 29.8 kB -> 8331.4 kB +- 20.2 kB: 1.13x slower (+13%) - raytrace: 7696.0 kB +- 30.3 kB -> 8695.4 kB +- 30.0 kB: 1.13x slower (+13%) - sqlite_synth: 8799.2 kB +- 25.5 kB -> 9937.4 kB +- 27.1 kB: 1.13x slower (+13%) - logging_silent: 7533.8 kB +- 32.0 kB -> 8488.2 kB +- 25.1 kB: 1.13x slower (+13%) - json_loads: 7317.8 kB +- 22.7 kB -> 8215.2 kB +- 21.5 kB: 1.12x slower (+12%) - unpickle_list: 7513.4 kB +- 9790 bytes -> 8420.6 kB +- 25.6 kB: 1.12x slower (+12%) - unpickle: 7519.8 kB +- 11.4 kB -> 8425.4 kB +- 27.1 kB: 1.12x slower (+12%) - fannkuch: 7170.0 kB +- 14.9 kB -> 8033.0 kB +- 22.5 kB: 1.12x slower (+12%) - pickle_list: 7514.6 kB +- 18.2 kB -> 8414.6 kB +- 24.0 kB: 1.12x slower (+12%) - telco: 7685.2 kB +- 15.0 kB -> 8598.2 kB +- 17.6 kB: 1.12x slower (+12%) - nbody: 7214.8 kB +- 10.7 kB -> 8070.2 kB +- 19.5 kB: 1.12x slower (+12%) - pickle: 7523.2 kB +- 12.4 kB -> 8415.0 kB +- 21.0 kB: 1.12x slower (+12%) - 2to3: 7171.2 kB +- 35.8 kB -> 8016.4 kB +- 21.7 kB: 1.12x slower (+12%) - nqueens: 7425.2 kB +- 21.8 kB -> 8296.8 kB +- 25.5 kB: 1.12x slower (+12%) - spectral_norm: 7212.6 kB +- 19.6 kB -> 8052.8 kB +- 18.4 kB: 1.12x slower (+12%) - regex_compile: 8538.0 kB +- 21.0 kB -> 9528.6 kB +- 22.1 kB: 1.12x slower (+12%) - pickle_pure_python: 7559.8 kB +- 19.4 kB -> 8430.0 kB +- 25.6 kB: 1.12x slower (+12%) - unpickle_pure_python: 7545.4 kB +- 9233 bytes -> 8413.0 kB +- 15.6 kB: 1.11x slower (+11%) - float: 23.9 MB +- 22.5 kB -> 26.6 MB +- 19.7 kB: 1.11x slower (+11%) - sqlalchemy_imperative: 18.2 MB +- 46.2 kB -> 20.2 MB +- 36.5 kB: 1.11x slower (+11%) - regex_effbot: 7910.8 kB +- 15.1 kB -> 8804.8 kB +- 20.9 kB: 1.11x slower (+11%) - pickle_dict: 7563.4 kB +- 15.3 kB -> 8415.2 kB +- 19.3 kB: 1.11x slower (+11%) - sqlalchemy_declarative: 18.9 MB +- 40.2 kB -> 21.0 MB +- 26.4 kB: 1.11x slower (+11%) - xml_etree_parse: 11.8 MB +- 12.6 kB -> 13.0 MB +- 16.3 kB: 1.11x slower (+11%) - html5lib: 20.1 MB +- 44.9 kB -> 22.2 MB +- 46.6 kB: 1.10x slower (+10%) - xml_etree_iterparse: 12.0 MB +- 26.5 kB -> 13.2 MB +- 31.3 kB: 1.10x slower (+10%) - sympy_integrate: 36.4 MB +- 26.7 kB -> 40.0 MB +- 33.2 kB: 1.10x slower (+10%) - sympy_str: 37.2 MB +- 28.4 kB -> 40.7 MB +- 26.6 kB: 1.10x slower (+10%) - sympy_expand: 36.2 MB +- 19.9 kB -> 39.7 MB +- 25.1 kB: 1.09x slower (+9%) - mako: 15.3 MB +- 19.1 kB -> 16.7 MB +- 25.4 kB: 1.09x slower (+9%) - django_template: 19.3 MB +- 14.9 kB -> 21.0 MB +- 14.6 kB: 1.09x slower (+9%) - xml_etree_generate: 12.3 MB +- 39.5 kB -> 13.3 MB +- 26.9 kB: 1.08x slower (+8%) - deltablue: 8918.0 kB +- 19.8 kB -> 9615.8 kB +- 12.5 kB: 1.08x slower (+8%) - dulwich_log: 12.2 MB +- 102.9 kB -> 13.1 MB +- 26.6 kB: 1.08x slower (+8%) - meteor_contest: 9296.0 kB +- 11.9 kB -> 9996.8 kB +- 20.7 kB: 1.08x slower (+8%) - sympy_sum: 62.2 MB +- 20.8 kB -> 66.5 MB +- 21.1 kB: 1.07x slower (+7%) - python_startup: 7946.6 kB +- 20.4 kB -> 8210.2 kB +- 16.6 kB: 1.03x slower (+3%) - python_startup_no_site: 7409.0 kB +- 18.3 kB -> 7574.6 kB +- 21.8 kB: 1.02x slower (+2%) ``` -- Inada Naoki <songofacandy@gmail.com>

[Antoine Pitrou <solipsis@pitrou.net>]
Ah, interesting. Were you able to measure the memory footprint as well?
[Inada Naoki <songofacandy@gmail.com>]
Could you say more about what's being measured here? Like, if this is on Linux, is it reporting RSS? VSS? Something else? mimalloc is "most like" obmalloc among those I've looked at in recent weeks. I noted before that its pools (they call them "pages") and arenas (called "segments") are at least 16x larger than obmalloc uses (64 KiB minimum pool/page size, and 4 MiB minimum arena/segment size, in mimalloc). Like all "modern" 64-bit allocators, it cares little about reserving largish blobs of address space, so I'd _expect_ Linuxy VSS to zoom. But it also releases (in some sense, ;like MADV_FREE) physical RAM back to the system at a granularity far smaller than arena'segment. At an extreme, the SuperMalloc I linked to earlier reserves a 512 MiB vector at the start, so no program linking to that can consume less than half a gig of address space. But it only expects to _use_ a few 4 KiB OS pages out of that. mimalloc doesn't do anything anywhere near _that_ gonzo (& since mimalloc comes out of Microsoft, it never will - "total virtual memory" on Windows is a system-wide resource, limited to the sum of physical RAM + pagefile size - no "overcommit" is allowed).

I guess that INADA-san used pyperformance --track-memory. pyperf --track-memory doc: "--track-memory: get the memory peak usage. it is less accurate than tracemalloc, but has a lower overhead. On Linux, compute the sum of Private_Clean and Private_Dirty memory mappings of /proc/self/smaps. On Windows, get PeakPagefileUsage of GetProcessMemoryInfo() (of the current process): the peak value of the Commit Charge during the lifetime of this process." https://pyperf.readthedocs.io/en/latest/runner.html#misc On Linux, pyperf uses read_smap_file() of pyperf._memory: https://github.com/vstinner/pyperf/blob/master/pyperf/_memory.py # Code to parse Linux /proc/%d/smaps files. # # See http://bmaurer.blogspot.com/2006/03/memory-usage-with-smaps.html for # a quick introduction to smaps. # # Need Linux 2.6.16 or newer. def read_smap_file(): total = 0 fp = open(proc_path("self/smaps"), "rb") with fp: for line in fp: # Include both Private_Clean and Private_Dirty sections. line = line.rstrip() if line.startswith(b"Private_") and line.endswith(b'kB'): parts = line.split() total += int(parts[1]) * 1024 return total It spawns a thread which reads /proc/self/smaps every milliseconds and then report the *maximum*. Victor Le jeu. 4 juil. 2019 à 17:12, Tim Peters <tim.peters@gmail.com> a écrit :
-- Night gathers, and now my watch begins. It shall not end until my death.

[Victor Stinner <vstinner@redhat.com>]
So I'll take that as meaning essentially that it's reporting what RSS would report if it ignored shared pages (so peak # of private pages actually backed by physical RAM). Clear as mud how MADV_FREE affects that.

I found calibrated loop count is not stable so memory usage is very different in some benchmarks. Especially, RAM usage of logging benchmark is very relating to loop count: $ PYTHONMALLOC=malloc LD_PRELOAD=$HOME/local/lib/libmimalloc.so ./python bm_logging.py simple --track-memory --fast --inherit-environ PYTHONMALLOC,LD_PRELOAD -v Run 1: calibrate the number of loops: 512 - calibrate 1: 12.7 MB (loops: 512) Calibration: 1 warmup, 512 loops Run 2: 0 warmups, 1 value, 512 loops - value 1: 12.9 MB Run 3: 0 warmups, 1 value, 512 loops - value 1: 12.9 MB ... $ PYTHONMALLOC=malloc LD_PRELOAD=$HOME/local/lib/libmimalloc.so ./python bm_logging.py simple --track-memory --fast --inherit-environ PYTHONMALLOC,LD_PRELOAD -v -l1024 Run 1: 0 warmups, 1 value, 1024 loops - value 1: 21.4 MB Run 2: 0 warmups, 1 value, 1024 loops - value 1: 21.4 MB Run 3: 0 warmups, 1 value, 1024 loops - value 1: 21.4 MB ... -- Inada Naoki <songofacandy@gmail.com>

On Thu, Jul 4, 2019 at 11:32 PM Inada Naoki <songofacandy@gmail.com> wrote:
I think I understand why the mimalloc uses more than twice memory than the pymalloc + glibc malloc in logging_format and logging_simple benchmarks. These two benchmarks does like this: buf = [] # in StringIO for _ in range(10*1024): buf.append("important: some important information to be logged") s = "".join(buf) # StringIO.getvalue() s.splitlines() mimalloc uses size segregated allocator for ~512KiB. And size class is determined top three bits. On the other hand, list increases capacity by 9/8. It means next size class is used on each realloc. At last, all size classes has1~3 used/cached memory blocks. This is almost worst case for mimalloc. In more complex application, there may be more chance to reuse memory blocks. In complex or huge application, this overhead will become relatively small. It's speed is attractive. But for memory efficiency, pymalloc + jemalloc / tcmalloc may be better for common cases. Regards, -- Inada Naoki <songofacandy@gmail.com>

[Inada Naoki <songofacandy@gmail.com>, trying mimalloc]
Hmm, it is not good. mimalloc uses MADV_FREE so it may affect to some benchmarks. I will look it later.
Often, but not always (multiplication by 9/8 may not change the top 3 bits - e.g., 128 * 9/8 = 144).
At last, all size classes has1~3 used/cached memory blocks.
No doubt part of it, but hard to believe it's most of it. If the loop count above really is 10240, then there's only about 80K worth of pointers in the final `buf`. To account for a difference of over 10M, it would need to have left behind well over 100 _full_ size copies from earlier reallocs. in fact the number of list elements across resizes goes like so: 0, 4, 8, 16, 25, 35, 46, ..., 7671, 8637, 9723, 10945 Adding all of those sums to 96,113, so accounts for less than 1M of 8-byte pointers if none were ever released. mimalloc will, of course, add its own slop on top of that - but not a factor of ten's worth. Unless maybe it's using a dozen such buffers at once? But does it really matter? ;-) mimalloc "should have" done MADV_FREE on the pages holding the older `buf` instances, so it's not like the app is demanding to hold on to the RAM (albeit that it may well show up in the app's RSS unless/until the OS takes the RAM away).
The mimalloc page says that, in their benchmarks: """ In our benchmarks (see below), mimalloc always outperforms all other leading allocators (jemalloc, tcmalloc, Hoard, etc), and usually uses less memory (up to 25% more in the worst case). """ obmalloc is certainly more "memory efficient" (for some meanings of that phrase) for smaller objects: in 3.7 it splits objects of <= 512 bytes into 64 size classes. mimalloc also has (close to) 64 "not gigantic" size classes, but those cover a range of sizes over a thousand times wider (up to about half a meg). Everything obmalloc handles fits in mimalloc's first 20 size classes. So mimalloc routinely needs more memory to satisfy a "small object" request than obmalloc does. I was more intrigued by your first (speed) comparison:
- spectral_norm: 202 ms +- 5 ms -> 176 ms +- 3 ms: 1.15x faster (-13%)
Now _that's_ interesting ;-) Looks like spectral_norm recycles many short-lived Python floats at a swift pace. So memory management should account for a large part of its runtime (the arithmetic it does is cheap in comparison), and obmalloc and mimalloc should both excel at recycling mountains of small objects. Why is mimalloc significantly faster? This benchmark should stay in the "fastest paths" of both allocators most often, and they both have very lean fastest paths (they both use pool-local singly-linked sized-segregated free lists, so malloc and free for both should usually amount to just popping or pushing one block off/on the head of the appropriate list). obmalloc's `address_in_range()` is definitely a major overhead in its fastest `free()` path, but then mimalloc has to figure out which thread is doing the freeing (looks cheaper than address_in_range, but not free). Perhaps the layers of indirection that have been wrapped around obmalloc over the years are to blame? Perhaps mimalloc's larger (16x) pools and arenas let it stay in its fastest paths more often? I don't know why, but it would be interesting to find out :-)

On Tue, Jul 9, 2019 at 9:46 AM Tim Peters <tim.peters@gmail.com> wrote:
You are right. List.append is not the major part of memory consumer of "large" class (8KiB+1 ~ 512KiB). They are several causes of large size alloc: * bm_logging uses StringIO.seek(0); StringIO.truncate() to reset buffer. So internal buffer of StringIO become Py_UCS4 array instead of a list of strings from the 2nd loop. This buffer is using same policy to list for increase capacity. `size + size >> 8 + (size < 9 ? 3 : 6)`. Actually, when I use `-n 1` option, memory usage is only 9MiB. * The intern dict. * Many modules are loaded, and FileIO.readall() is used to read pyc files. This creates and deletes various size of bytes objects. * logging module uses several regular expressions. `b'\0' * 0xff00` is used in sre_compile. https://github.com/python/cpython/blob/master/Lib/sre_compile.py#L320
mimalloc doesn't call madvice for each free(). Each size classes keeps a 64KiB "page". And several pages (4KiB) in the "page" are committed but not used. I dumped all "mimalloc page" stat. https://paper.dropbox.com/doc/mimalloc-on-CPython--Agg3g6XhoX77KLLmN43V48cfAg-fFyIm8P9aJpymKQN0scpp#:uid=671467140288877659659079&h2=memory-usage-of-logging_format For example: bin block_size used capacity reserved 29 2560 1 22 25 (14 pages are committed, 2560 bytes are in use) 29 2560 14 25 25 (16 pages are committed, 2560*14 bytes are in use) 29 2560 11 25 25 31 3584 1 5 18 (5 pages are committed, 3584 bytes are in use) 33 5120 1 4 12 33 5120 2 12 12 33 5120 2 12 12 37 10240 3 11 409 41 20480 1 6 204 57 327680 1 2 12 * committed pages can be calculated by `ceil(block_size * capacity / 4096)` roughly. There are dozen of unused memory block and committed pages in each size classes. This caused 10MiB+ memory usage overhead on logging_format and logging_simple benchmarks.
Totally agree. I'll investigate this next. Regards, -- Inada Naoki <songofacandy@gmail.com>

On Tue, Jul 9, 2019 at 5:29 PM Inada Naoki <songofacandy@gmail.com> wrote:
I compared "perf" output of mimalloc and pymalloc, and I succeeded to optimize pymalloc! $ ./python bm_spectral_norm.py --compare-to ./python-master python-master: ..................... 199 ms +- 1 ms python: ..................... 182 ms +- 4 ms Mean +- std dev: [python-master] 199 ms +- 1 ms -> [python] 182 ms +- 4 ms: 1.10x faster (-9%) mimalloc uses many small static (inline) functions. On the other hand, pymalloc_alloc and pymalloc_free is large function containing slow/rare path. PyObject_Malloc inlines pymalloc_alloc, and PyObject_Free inlines pymalloc_free. But compiler doesn't know which is the hot part in pymalloc_alloc and pymalloc_free. So gcc failed to chose code to inline. Remaining part of pymalloc_alloc and pymalloc_free are called and many push/pop are executed because they contains complex logic. So I tried to use LIKELY/UNLIKELY macro to teach compiler hot part. But I need to use "static inline" for pymalloc_alloc and pymalloc_free yet [1]. Generated assembly is optimized well, the hot code is in top of the PyObject_Malloc [2] and PyObject_Free [3]. But there are many code duplication in PyObject_Malloc and PyObject_Calloc, etc... [1] https://github.com/python/cpython/pull/14674/files [2] https://gist.github.com/methane/ab8e71c00423a776cb5819fa1e4f871f#file-obmall... [3] https://gist.github.com/methane/ab8e71c00423a776cb5819fa1e4f871f#file-obmall... I will try to split pymalloc_alloc and pymalloc_free to smaller functions. Except above, there is one more important difference. pymalloc return free pool to freepool list soon when pool become empty. On the other hand, mimalloc return "page" (it's similar to "pool" in pymalloc) not everytime when it's empty [4]. So they can avoid rebuilding linked list of free blocks. I think pymalloc should do same optimization. [4] https://github.com/microsoft/mimalloc/blob/1125271c2756ee1db1303918816fea35e... BTW, which is proper name? pymalloc, or obmalloc. Regards, -- Inada Naoki <songofacandy@gmail.com>

[Inada Naoki <songofacandy@gmail.com>, looking into why mimalloc did so much better on spectral_norm]
Yay!! Nice :-)
mimalloc uses many small static (inline) functions.
Too many , if you ask me ;-) Really, the enormous number of tiny functions makes the code very hard to follow for me. OTOH, the tiny number of enormous functions in pymalloc makes that hard to follow too :-(
On the other hand, pymalloc_alloc and pymalloc_free is large function containing slow/rare path.
obmalloc.c was written when compiler inlining was barely a thing. Some restructuring is a fine idea - overdue, in fact.
Yes, splitting out the slower paths should be a win. Note this is one reason I remain keen to increase pool size: the bigger the pools, the more objects can be handed out & reclaimed _from_ the fastest paths. You've now discovered that "the fastest paths" are, for pragmatic compiler-is-overwhelmed reasons, significantly slower than they "should be".
Except above, there is one more important difference.
pymalloc return free pool to freepool list soon when pool become empty.
Fine by me, & I think it should continue to do so. Details below.
pymalloc never returns anything to the system at smaller than "arena" granularity, so it's not a big deal at all _merely_ to link and unlink a pool to/from an arena's freepool list. There's no possibility than any byte in the pool can change while the pool is sitting on a freepools list. If we freed the last pool of a given size class, and next need to allocate another object of that size class (most common), it's cheap: when the pool is unlinked from the pool free list, it sees that the pool's last size class is the same as the size class being requested, and skips almost all pool initialization (the pool is already fully prepared for objects of this size class). init_pool: /* Frontlink to used pools. */ next = usedpools[size + size]; /* == prev */ pool->nextpool = next; pool->prevpool = next; next->nextpool = pool; next->prevpool = pool; pool->nalloc = 1; if (pool->szidx == size) { // xxxxxxxx HERE xxxxxxxc /* Luckily, this pool last contained blocks * of the same size class, so its header * and free list are already initialized. */ bp = pool->freeblock; assert(bp != NULL); pool->freeblock = *(block **)bp; goto success; // xxxxxxxxxxx FASTEST PATH xxxxxxxxxxxxxx } // and here there's a mound of code in case // the size classes don't match, including code // to restart the process of linking the pool's // blocks into a free list. So In the common case, the pool's free list is reused exactly as-is at the time the pool was linked to the freepool list.
I think pymalloc should do same optimization.
As above, I believe pymalloc is already optimal in this case, On the other end, something to consider: when a block is freed from an entirely used pool, the pool now contains one free block, and is linked to the front of usedpools[sizeclass]. So the next request for a block of that size will come from that pool, which will again cause the pool to become entirely used. That's a kind of possible thrashing, and acts against mimalloc's (sane, to my eyes) desire to _entirely_ use up a given pool ("page") once it starts passing things out from a pool. It could well be better to, e.g., link the pool-with-one-newly-free-block to the end of the usedpool list instead. But it's unclear. The reason it front-links is that the block being freed is presumably in L1 cache (because, e.g., its refcount just fell to 0), so best to hand it out again ASAP. But that old reasoning applied when pymalloc was _only_ used for PyObject-like structs.
[4] https://github.com/microsoft/mimalloc/blob/1125271c2756ee1db1303918816fea35e...
BTW, which is proper name? pymalloc, or obmalloc.
I think it's pymalloc, so that's what I used above, I got lazy before ;-)

On 2019-07-09, Inada Naoki wrote:
I think LIKELY/UNLIKELY is not helpful if you compile with LTO/PGO enabled. So, I would try that first. Also, if you use relatively small static functions that are defined before use (no forward declarations), I have found that GCC is usually smart about inlining them. So, I don't think you should have to use "static inline" rather than just "static". Good work looking into this. Should be some relatively easy performance win. Cheers, Neil

[Inada Naoki]
[Neil Schemenauer]
I think LIKELY/UNLIKELY is not helpful if you compile with LTO/PGO enabled.
I like adding those regardless of whether compilers find them helpful: they help _people_ reading the code focus on what's important to speed. While not generally crucial, speed is important in these very low-level, very heavily used functions. Speaking of which, another possible teensy win: pymalloc's allocation has always started with: if (nbytes == 0) { return 0; } if (nbytes > SMALL_REQUEST_THRESHOLD) { return 0; } size = (uint)(nbytes - 1) >> ALIGNMENT_SHIFT; But it could be a bit leaner: size_t fatsize = (nbytes - 1) >> ALIGNMENT_SHIFT; if (UNLIKELY(fatsize >= NB_SMALL_SIZE_CLASSES)) { return 0;' } size = (uint)fatsize; The `nbytes == 0` case ends up mapping to a very large size class then, although C may not guarantee that. But it doesn't matter: if it maps to "a real" size class, that's fine. We'll return a unique pointer into a pymalloc pool then, and "unique pointer" is all that's required. An allocation requesting 0 bytes does happen at times, but it's very rare. It just doesn't merit its own dedicated test-&-branch.
Good work looking into this. Should be some relatively easy performance win.
Ditto!

On 2019-07-09, Inada Naoki wrote:
Hello Inada, I don't see this on my PC. I'm using GCC 8.3.0. I have configured the build with --enable-optimizations. To speed up the profile generation, I have changed PROFILE_TASK to only run these tests: test_shelve test_set test_pprint test_pickletools test_ordered_dict test_tabnanny test_difflib test_pickle test_json test_collections I haven't spent much time trying to figure out what set of tests is best but the above set runs pretty quickly and seems to work okay. I have run pyperformance to compare CPython 'master' with your PR 14674. There doesn't seem to be a difference (table below). If I look at the disassembly, it seems that the hot paths of pymalloc_alloc and pymalloc_free are being inlined as you would hope, without needing the LIKELY/UNLIKELY annotations. OTOH, your addition of LIKELY() and UNLIKELY() in the PR is a pretty small change and probably doesn't hurt anything. So, I think it would be fine to merge it. Regards, Neil +-------------------------+---------+-----------------------------+ | Benchmark | master | PR-14674 | +=========================+=========+=============================+ | 2to3 | 305 ms | 304 ms: 1.00x faster (-0%) | +-------------------------+---------+-----------------------------+ | chaos | 109 ms | 110 ms: 1.01x slower (+1%) | +-------------------------+---------+-----------------------------+ | crypto_pyaes | 118 ms | 117 ms: 1.01x faster (-1%) | +-------------------------+---------+-----------------------------+ | django_template | 112 ms | 114 ms: 1.02x slower (+2%) | +-------------------------+---------+-----------------------------+ | fannkuch | 446 ms | 440 ms: 1.01x faster (-1%) | +-------------------------+---------+-----------------------------+ | float | 119 ms | 120 ms: 1.01x slower (+1%) | +-------------------------+---------+-----------------------------+ | go | 247 ms | 250 ms: 1.01x slower (+1%) | +-------------------------+---------+-----------------------------+ | json_loads | 25.1 us | 24.4 us: 1.03x faster (-3%) | +-------------------------+---------+-----------------------------+ | logging_simple | 8.86 us | 8.66 us: 1.02x faster (-2%) | +-------------------------+---------+-----------------------------+ | meteor_contest | 97.5 ms | 97.7 ms: 1.00x slower (+0%) | +-------------------------+---------+-----------------------------+ | nbody | 140 ms | 142 ms: 1.01x slower (+1%) | +-------------------------+---------+-----------------------------+ | pathlib | 19.2 ms | 18.9 ms: 1.01x faster (-1%) | +-------------------------+---------+-----------------------------+ | pickle | 8.95 us | 9.08 us: 1.02x slower (+2%) | +-------------------------+---------+-----------------------------+ | pickle_dict | 18.1 us | 18.0 us: 1.01x faster (-1%) | +-------------------------+---------+-----------------------------+ | pickle_list | 2.75 us | 2.68 us: 1.03x faster (-3%) | +-------------------------+---------+-----------------------------+ | pidigits | 182 ms | 184 ms: 1.01x slower (+1%) | +-------------------------+---------+-----------------------------+ | python_startup | 7.83 ms | 7.81 ms: 1.00x faster (-0%) | +-------------------------+---------+-----------------------------+ | python_startup_no_site | 5.36 ms | 5.36 ms: 1.00x faster (-0%) | +-------------------------+---------+-----------------------------+ | raytrace | 495 ms | 499 ms: 1.01x slower (+1%) | +-------------------------+---------+-----------------------------+ | regex_dna | 173 ms | 170 ms: 1.01x faster (-1%) | +-------------------------+---------+-----------------------------+ | regex_effbot | 2.79 ms | 2.67 ms: 1.05x faster (-4%) | +-------------------------+---------+-----------------------------+ | regex_v8 | 21.1 ms | 21.2 ms: 1.00x slower (+0%) | +-------------------------+---------+-----------------------------+ | richards | 68.2 ms | 68.7 ms: 1.01x slower (+1%) | +-------------------------+---------+-----------------------------+ | scimark_monte_carlo | 103 ms | 102 ms: 1.01x faster (-1%) | +-------------------------+---------+-----------------------------+ | scimark_sparse_mat_mult | 4.37 ms | 4.35 ms: 1.00x faster (-0%) | +-------------------------+---------+-----------------------------+ | spectral_norm | 132 ms | 133 ms: 1.01x slower (+1%) | +-------------------------+---------+-----------------------------+ | sqlalchemy_imperative | 30.3 ms | 30.7 ms: 1.01x slower (+1%) | +-------------------------+---------+-----------------------------+ | sympy_sum | 88.2 ms | 89.2 ms: 1.01x slower (+1%) | +-------------------------+---------+-----------------------------+ | telco | 6.63 ms | 6.58 ms: 1.01x faster (-1%) | +-------------------------+---------+-----------------------------+ | tornado_http | 178 ms | 179 ms: 1.01x slower (+1%) | +-------------------------+---------+-----------------------------+ | unpickle | 12.0 us | 12.4 us: 1.03x slower (+3%) | +-------------------------+---------+-----------------------------+ | unpickle_list | 3.93 us | 3.75 us: 1.05x faster (-4%) | +-------------------------+---------+-----------------------------+ Not significant (25): deltablue; dulwich_log; hexiom; html5lib; json_dumps; logging_format; logging_silent; mako; nqueens; pickle_pure_python; regex_compile; scimark_fft; scimark_lu; scimark_sor; sqlalchemy_declarative; sqlite_synth; sympy_expand; sympy_integrate; sympy_str; unpack_sequence; unpickle_pure_python; xml_etree_parse; xml_etree_iterparse; xml_etree_generate; xml_etree_process

On Wed, Jul 10, 2019 at 5:18 PM Neil Schemenauer <nas-python@arctrix.com> wrote:
I didn't use PGO and that's why GCC didn't know which part is hot. Maybe, pymalloc performance is similar to mimalloc when PGO is used, but I had not confirmed it. While Linux distributions are using PGO, some people use non-PGO Python (Homebrew, pyenv, etc...). So better performance without PGO is worth. Regards, -- Inada Naoki <songofacandy@gmail.com>

I did it and pymalloc is now as fast as mimalloc. $ ./python bm_spectral_norm.py --compare-to=./python-master python-master: ..................... 199 ms +- 1 ms python: ..................... 176 ms +- 1 ms Mean +- std dev: [python-master] 199 ms +- 1 ms -> [python] 176 ms +- 1 ms: 1.13x faster (-11%) I filed an new issue for this: https://bugs.python.org/issue37543

Le ven. 21 juin 2019 à 23:19, Thomas Wouters <thomas@python.org> a écrit :
Is this really feasible in a world where the allocators can be selected (and the default changed) at runtime?
The memory allocation must not be changed after the Python pre-initialization. What's done after pre-initialization is more to put "hook" which executes code before/after an allocation, but don't replace the allocator. It simply doesn't work to switch from pymalloc to malloc "at runtime". Calling PyMem_Free(ptr) would call free(ptr). If the memory block was allocated by pymalloc, free(ptr) does simply crash. Victor

[Inada Naoki <songofacandy@gmail.com>]
Which is 3 lines of code plus a closing brace. The OP didn't build their own Python, and the source from which it was compiled wasn't available to them. But they did say that when they got into gdb and single-stepped, it was looping through the same three lines of code, in obmalloc.c, over & over. Which is 100% consistent with the loop you identified as being "the hot spot".

On Fri, May 24, 2019 at 3:49 AM Gregory P. Smith <greg@krypto.org> wrote:
I suggest filing a bug to track this...
I created the issue: https://bugs.python.org/issue37029 -- Inada Naoki <songofacandy@gmail.com>

It seems pretty clear now that the primary cause is keeping arenas sorted by number of free pools. As deallocation goes on, the number of distinct "# of free pools" values decreases, leaving large numbers of arenas sharing the same value. Then, e.g., if there are 10,000 arenas with 30 free pools remaining, and another pool is freed in one of them, its free pool count jumps to 31, and on average we have to crawl over 5,000 of its former peers to locate its new position in the list. Which is also consistent with what the OP saw, and I did too but in a much smaller case: the longer deallocation goes on, the slower it gets (fewer and fewer Nodes freed per unit time). Which suggests a different - albeit related - approach: it's not really the order of arenas that matters, but the partitioning of arenas by number of free pools. So instead of a singly doubly linked list of arenas, have a vector of doubly linked lists, one list per possible number of free pools. That's a fixed and relatively small number. For example, even if all arena bytes could be used for pools (they can't be - there's bookkeeping info at the start of an arena), the maximum number of free pools in an arena would be 2**18 / 2**12 == 2**6 == 64. When a pool is freed, no big deal: unlink the arena from its current list, and stick it in the list for the next higher number of free pools. This, of course, is constant-time. Of course there are edge cases (e.g., if the area is entirely free, it should be given back to C instead), but they seem pretty obvious, and I haven't thought of a real problem. Tedious, though ;-)

On 5/22/19 12:15 PM, Tim Peters wrote:
I have a computer with two Xeon CPUs and 256GB of RAM. So, even though it's NUMA, I still have 128GB of memory per CPU. It's running a "spin" of Ubuntu 18.10. I compiled a fresh Python 3.7.3 --with-optimizations. I copied the sample program straight off the StackOverflow page. The program ran for about five and a half hours then exited normally. During the run it printed: This gets printed! This doesn't get printed Statistics reported by "time": 19811.05s user 123.56s system 99% cpu 5:32:15.04 total Checking in on it now and then, peak observed memory usage (as reported by "top") was just under 80GB. I take it that the interesting part was confirming that "This doesn't get printed" gets printed when you have enough RAM for the program to run to completion. So I guess there's no bug here? Just an observation about CPython's garbage collector being kinda slow? Or maybe CPython gc + swap = super bombad slow? //arry/

[Larry Hastings <larry@hastings.org>]
Thanks, Larry!
All of that is consistent with what others reported, although you've given more quantified details, and that's helpful.
I take it that the interesting part was confirming that "This doesn't get printed" gets printed when you have enough RAM for the program to run to completion.
That was the primary point at first, yes,
No need to confirm this: if you ran it again with a bit more output, you'd find that the vast bulk of the time was spent between the two output lines. That is, it takes hours from the time the train() function starts to return and completes returning. That's all consumed as a side effect of `tree` losing its last reference. We know why now, and it's hard to say whether it's "a bug". It's functioning as designed, but the design sucks ;-) So I'd call it a design flaw. The heuristics that were introduced to try to make it likely we could return more empty arenas to C were designed when machines had far less RAM, and can burn time quadratic in the number of arenas. That really didn't matter in a world with a few hundred arenas, but can be a timing disaster in a world with hundreds of thousands (or more) of arenas. The Stackoverflow test case was whittled way down from the OP's real program, which ran "for days" without completing. I've sketched two (albeit closely related, both to each other and to what we currently do) ways the worst-case overall time could be cut from quadratic to linear in the number of arenas, which is asymptotically optimal. More ambitious would be to dream up an entirely different heuristic. Keeping the list of usable arenas fully sorted in order of number of free pools seems a very strong requirement for a poke-and-hope heuristic. But, best I can tell, the person who came up with this heuristic to begin with didn't check in any motivating test programs. So we'd be starting over from scratch. For this specific Stackoverflow case, I've confirmed that it doesn't make a real difference if we didn't bother to sort the list of arenas _at all_ (then it still returns about the same number of arenas to C, both at the point the tree finishes building, and at the time it completes tearing down the tree). So I have a program showing the current scheme can be a disaster, but none where it shows it actually helps anything ;-)

I made a pull request for this that appears to work fine for my 10x smaller test case (reduces tear-down time from over 45 seconds to over 7). It implements my second earlier sketch (add a vector of search fingers, to eliminate searches): https://github.com/python/cpython/pull/13612 It would be good to know how it works on the OP's 10x larger test case, which requires a machine with over 80 GB of RAM. Hoping it will reduce tear-down time from hours to minutes (they have about a billion `Node` objects, so tear-down time will never be trivial). It would also be good to get a code review, if you have the time and inclination. Thanks!

The PR for this looks good to go: https://github.com/python/cpython/pull/13612 But, I still have no idea how it works for the OP's original test case. So, if you have at least 80 GB of RAM to try it, I added `arena.py` to the BPO report: https://bugs.python.org/issue37029 That adds code to the OP's test case to display the times needed to build the tree and to tear it down (& to display some obmalloc stats). So there's no need for you to think about anything ;-) I'm keen to get feedback on this before merging the PR, because this case is so very much larger than anything I've ever tried that I'm wary that there may be more than one "surprise" lurking here. The PR certainly addresses "an obvious" (with hindsight) problem - but is that the _only_ gross design flaw here?

[Tim]
[Inada Naoki <songofacandy@gmail.com>]
I started r5a.4xlarge EC2 instance and started arena.py. I will post the result in next 12 hours.
Thank you! Wrapping this up, Inada attached the results to the bug report. For the OP's original big case, the time to reclaim the tree storage dropped from nearly 5 hours to about 70 seconds. The time to build the tree didn't materially change (it was a bit faster after the patch, but offhand didn't look significantly faster to me). Since I called this a "design flaw" rather than "a bug", I'm not inclined to backport it. If someone else thinks it should be, that's up to them. I'm _assuming_ Github won't decide to backport it on its own - but maybe I'm wrong (Github workflow is still pretty magical to me).

I have only 32GB mem, but AWS has larger memory machine! Linux perf shows here is bottleneck: https://github.com/python/cpython/blob/master/Objects/obmalloc.c#L1784-L1819 obmalloc sorts arenas by number of free pools. If there are many arenas, and memory block is freed by random order, this sorting become O(N^2). That's too bad. I'm trying address order instead. Regards, -- Inada Naoki <songofacandy@gmail.com>

On Thu, May 23, 2019 at 11:52 AM Inada Naoki <songofacandy@gmail.com> wrote:
I have only 32GB mem, but AWS has larger memory machine!
Be aware that on larger cloud vendor instances, the backing vCPUs and memory you get allocated might or might not be spread opaquely across multiple sockets and/or NUMA nodes of the backing hardware. Some of them have options where you can allocate raw hardware as well so you can opt to lock your process to run within just one NUMA node and ensure hardware locality for your performance debugging. I'm pointing this out in case you experience any mystery jitters in your benchmark results. -- Joni Orponen

1. perf shows 95% of CPU time is eaten by _PyObject_Free, not kernel space. 2. This loop is cleary hot: https://github.com/python/cpython/blob/51aa35e9e17eef60d04add9619fe2a7eb9383... I can attach the process by gdb and I confirmed many arenas have same nfreepools. I believe this is not jitter caused from NUMA or something else in cloud. -- Inada Naoki <songofacandy@gmail.com>

On 23May2019 0542, Inada Naoki wrote:
It's relatively easy to test replacing our custom allocators with the system ones, yes? Can we try those to see whether they have the same characteristic? Given the relative amount of investment over the last 19 years [1], I wouldn't be surprised if most system ones are at least as good for our needs now. Certainly Windows HeapAlloc has had serious improvements in that time to help with fragmentation and small allocations. Cheers, Steve [1]: https://github.com/python/cpython/blob/51aa35e9e17eef60d04add9619fe2a7eb9383...

For the record, result for 10M nodes, Ubuntu 18.04 on AWS r5a.4xlarge: $ local/bin/python3 t1.py # default 1138.1123778309993 -- end train, start del 688.7927911250008 -- end $ arena-1m/bin/python3 t1.py # Changed ARENA_SIZE to 1MiB 1085.3363994170013 -- end train, start del 84.57135540099989 -- end $ PYTHONMALLOC=malloc local/bin/python3 t1.py 1157.4882792789995 -- end train, start del 27.919834706000074 -- end $ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 PYTHONMALLOC=malloc local/bin/python3 t1.py 1098.4383037820007 -- end train, start del 117.93938426599925 -- end In this case, glibc malloc is the fastest. glibc is know to weak about fragmentation. But algorithm to avoid fragmentation is just an overhead in this script. Regards, -- Inada Naoki <songofacandy@gmail.com>

Le ven. 24 mai 2019 à 09:41, Inada Naoki <songofacandy@gmail.com> a écrit :
688 => 84 looks like an interesting speedup. Would it be technically possible to make ARENA_SIZE configurable at runtime? Using the new PEP 587 preinitialization, it shouldn't be too hard to hard a new command line option and/or an environment variable to tune the memory allocator: https://www.python.org/dev/peps/pep-0587/#preinitialization-with-pypreconfig Victor -- Night gathers, and now my watch begins. It shall not end until my death.

[Inada Naoki <songofacandy@gmail.com>]
For the record, result for 10M nodes, Ubuntu 18.04 on AWS r5a.4xlarge:
I'm unclear on what "nodes" means. If you mean you changed 27M to 10M in this line: for token in random_strings(27_000_000): that's fine, but there are about 40 times more than that `Node` objects created by the program. So if you did change that to 10_000_000, you'd have about 400M `Node` objects, consuming about 80x that many bytes of RAM (or about 32GB).
Nice! I assume these are seconds. On Stackoverflow, the OP reported that boosting ARENA_SIZE the same way cut deallocation time in his original program to about 13 minutes.
While the OP reported, for the original program: """ With PYTHONMALLOC=malloc, insertion takes nearly twice as long, but deallocation only takes 2 minutes! """ The "nearly twice as long" for building the tree is in sharp contrast to what you saw, but there's about 3x as much of everything in the original program, and, e.;g., it's certainly possible malloc is tripping over fragmentation then that obmalloc avoids.
I can't say. I'll note that the approach I very briefly sketched before (restructure the list of arenas to partition it into multiple lists partitioned by number of free pools) "should make" obmalloc competitive with malloc here (it would eliminate all searches, except perhaps (depending on how gonzo the code is) a rare need to search for "the first" non-empty list).

[Tim]
But it's also intrusive, breaking up a simple linked list into a mutli-headed beast. That affects all code using it, not just the parts mutating it. So instead I suggest leaving `usable_arenas` entirely alone, but add a vector of "search fingers", say, static struct arenaobject* nfp2lasta[MAX_NUMBER_OF_FREE_POOLS_POSSIBLE + 1]; mapping a count of free pools to (a pointer to) the last arena object in usable_arenas with that number of free pools. Then the search loop in py_malloc_free() can be replaced with a single array lookup.: it leaps directly to where the search loop ends now. For example, if we free a pool in an arena that had 23 free pools, then the arena object's count of free pools has to be increased to 24, and the arena object unlinked from its current position in usable_arenas and inserted immediately after nfp2lasta[23]. Note that usable_arenas is a doubly linked list, so there's no _need_ at all to iterate over it. Each node in the list knows what's immediately before and after it. And since a free pool count can only increase by 1, it's always correct to move the arena immediately after the last arena with the same original free pool count. Key invariants: 1. nfp2lasta[n] is NULL if and only if no arena in usable_arenas has nfreepools == n. 2. nfp2lasta[pa->nfreepools] == pa if and only if pa is the only arena in usable_arenas with that many free pools. So, in the example above, nfp2lasta[23] has to be set to NULL if and only if the arena in question was the only one with 23 free pools (which can be determined cheaply via invariant #2). And nfp2lasta[24] has to be set to point to the arena in question if and only if nfp2lasta[24] is NULL. Tricky bit: if the arena in question is the only one with a given original free pool count, no pointers in arena objects need to be changed at all. Following the sketch above literally would leave you trying to insert it after itself, which wouldn't end well ;-) Anyway, this "feels like a right solution" to me, eliminating all searches with simple (albeit delicate) constant-time code, and requiring code changes only where an arena's number of free pools can change.

[Tim]
Ack! Scratch that. I need a nap :-( In fact if that equality holds, it means that nfp2lasta entry has to change if pa is moved and pa->prevarena has the same count of free pools. So forget about the explanation and just think about the goal - the details will work themselves out ;-)

On Thu, May 23, 2019 at 5:15 PM Steve Dower <steve.dower@python.org> wrote:
FYI, and I've mentioned this at PyCon to a few people (might've been you, Steve, I don't remember) -- but at Google we've experimented with disabling obmalloc when using TCMalloc (a faster and thread-aware malloc, which makes a huge difference within Google when dealing with multi-threaded C++ libraries), both using the usual Python benchmarks and real-world code with real-world workloads (a core part of YouTube, for example), all on Linux. There's still a significant benefit to using obmalloc when using glibc's malloc, and also a noticeable benefit when using TCMalloc. There are certainly cases where it doesn't matter much, and there may even be cases where the overhead of obmalloc isn't worth it, but I do believe it's still a firm net benefit.
-- Thomas Wouters <thomas@python.org> Hi! I'm an email virus! Think twice before sending your email to help me spread!

On Fri, 24 May 2019 14:23:21 +0200 Thomas Wouters <thomas@python.org> wrote:
Interesting that a 20-year simple allocator (obmalloc) is able to do better than the sophisticated TCMalloc. (well, of course, obmalloc doesn't have to worry about concurrent scenarios, which explains some of the simplicity) Regards Antoine.

[Antoine Pitrou, replying to Thomas Wouters]
Interesting that a 20-year simple allocator (obmalloc) is able to do better than the sophisticated TCMalloc.
It's very hard to beat obmalloc (O) at what it does. TCMalloc (T) is actually very similar where they overlap, but has to be more complex because it's trying to do more than O. In either case, for small objects "the fast path" consists merely of taking the first block of memory off a singly-linked size-segregated free list. For freeing, the fast path is just linking the block back to the front of the appropriate free list. What _could_ be faster? A "bump allocator" allocates faster (just increment a highwater mark), but creates worlds of problems when freeing. But because O is only trying to deal with small (<= 512 bytes) requests, it can use a very fast method based on trivial address arithmetic to find the size of an allocated block by just reading it up from the start of the (4K) "pool" the address belongs to. T can't do that - it appears to need to look up the address in a more elaborate radix tree, to find info recording the size of the block (which may be just about anything - no upper limit).
(well, of course, obmalloc doesn't have to worry about concurrent scenarios, which explains some of the simplicity)
Right, T has a different collection of free lists for each thread. so on each entry has to figure out which collection to use (and so doesn't need to lock). That's not free. O only has one collection, and relies on the GIL. Against that, O burns cycles worrying about something else: because it was controversial when it was new, O thought it was necessary to handle free/realloc calls even when passed addresses that had actually been obtained from the system malloc/realloc. The T docs I saw said "don't do that - things will blow up in mysterious ways". That's where O's excruciating "address_in_range()" logic comes from. While that's zippy and scales extremely well (it doesn't depend on how many objects/arenas/pools exist), it's not free, and is a significant part of the "fast path" expense for both allocation and deallocation. It also limits us to a maximum pool size of 4K (to avoid possible segfaults when reading up memory that was actually obtained from the system malloc/realloc), and that's become increasingly painful: on 64-bit boxes the bytes lost to pool headers increased, and O changed to handle requests up to 512 bytes instead of its original limit of 256. O was intended to supply "a bunch" of usable blocks per pool, not just a handful. We "should" really at least double the pool and arena sizes now. I don't think we need to cater anymore to careless code that mixes system memory calls with O calls (e.g., if an extension gets memory via `malloc()`, it's its responsibility to call `free()`), and if not then `address_in_range()` isn't really necessary anymore either, and then we could increase the pool size. O would, however, need a new way to recognize when its version of malloc punted to the system malloc. BTW, one more: last I saw T never returns memory to "the system", but O does - indeed, the parent thread here was all about _enormous_ time waste due to that in O ;-) That's not free either, but doesn't affect O's fast paths.

On Sun, 2 Jun 2019 00:56:52 -0500 Tim Peters <tim.peters@gmail.com> wrote:
The interesting thing here is that in many situations, the size is known up front when deallocating - it is simply not communicated to the deallocator because the traditional free() API takes a sole pointer, not a size. But CPython could communicate that size easily if we would like to change the deallocation API. Then there's no bother looking up the allocated size in sophisticated lookup structures. I'll note that jemalloc provides such APIs: http://jemalloc.net/jemalloc.3.html """The dallocx() function causes the memory referenced by ptr to be made available for future allocations. The sdallocx() function is an extension of dallocx() with a size parameter to allow the caller to pass in the allocation size as an optimization.""" Regards Antoine.

[Antoine Pitrou <solipsis@pitrou.net>]
That could work (to make it possible to increase obmalloc's pool size). Except ...
obmalloc doesn't intend to be a general-purpose allocator - it only aims at optimizing "small" allocations, punting to the system for everything beyond that. Unless the size is _always_ passed in (on every free() and realloc() spelling it supports), an "optimization" doesn't help it much. It needs a bulletproof way to determine whether it, or system malloc/realloc, originally obtained an address passed in. If the size is always passed in, no problem (indeed, a single bit could suffice). But if it's ever possible that the size won't be passed in, all the runtime machinery to figure that out on its own needs to be able to handle all addresses. Like now: if the size were passed in, obmalloc could test the size instead of doing the `address_in_range()` dance(*). But if it's ever possible that the size won't be passed in, all the machinery supporting `address_in_range()` still needs to be there, and every obmalloc spelling of malloc/realloc needs to ensure that machinery will work if the returned address is passed back to an obmalloc free/realloc spelling without the size. The "only"problem with address_in_range is that it limits us to a maximum pool size of 4K. Just for fun, I boosted that to 8K to see how likely segfaults really are, and a Python built that way couldn't even get to its first prompt before dying with an access violation (Windows-speak for segfault). Alas, that makes it hard to guess how much value there would be for Python programs if the pool size could be increased - can't even get Python started. We could eliminate the pool size restriction in many ways. For example, we could store the addresses obtained from the system malloc/realloc - but not yet freed - in a set, perhaps implemented as a radix tree to cut the memory burden. But digging through 3 or 4 levels of a radix tree to determine membership is probably significantly slower than address_in_range. I can think of a way to do it slightly faster than (but related to) address_in_range, but it would (on a 64-bit box) require adding 24 more bytes for each system-malloc/realloc allocation. 8 of those bytes would be pure waste, due to that the Alignment Gods appear to require 16-byte alignment for every allocation on a 64-bit box now. In stark contrast, the extra memory burden of the current address_in_range is an insignificant 8 bytes per _arena_ (256 KB, and "should be" larger now). Another approach: keep address_as_range as-is, but add new internal structure to larger pools, to repeat the arena index every 4KB. But that fights somewhat against the goal of larger pools. Etc. ;-) (*) Actually not quite true. If a "large" object is obtained from obmalloc now (meaning it actually came from the system malloc), then cut back to a "small" size by a realloc, it _remains_ under the control of the system malloc now. Passing in the newer "small" size to a free() later would cause obmalloc to get fatally confused about that.

On Thu, 6 Jun 2019 13:57:37 -0500 Tim Peters <tim.peters@gmail.com> wrote:
But my response was under the assumption that we would want obmalloc to deal with all allocations. Which is more or less required anyway to have an efficient GC that doesn't have to walk linked lists and access memory in random order to iterate over known objects. Regards Antoine.

[Antoine Pitrou <solipsis@pitrou.net>]
But my response was under the assumption that we would want obmalloc to deal with all allocations.
I didn't know that. I personally have no interest in that: if we want an all-purpose allocator, there are several already to choose from. There's no reason to imagine we could write a better one.
As the parent thread showed, obmalloc does at least as well as any general-purpose allocator known _for Python's purposes_ (a great many (de)allocations of "small" objects). Already explained too that size-segregated linked free lists are the core of _why_ it does so well. Besides making carving off, and freeing, blocks dirt cheap, linked lists also naturally support recycling memory blocks in MRU ("freshness in cache") order. But I don't know what you mean by "access memory in random order to iterate over known objects". obmalloc never needs to iterate over known objects - indeed, it contains no code capable of doing that.. Our cyclic gc does, but that's independent of obmalloc. Over on Discourse, Neil is speculating about using radix trees for cyclic gc instead of _its_ linked lists In obmalloc, allocated memory regions aren't linked at all. It's free regions that are linked, and helpfully so in MRU order.

On Thu, 6 Jun 2019 16:03:03 -0500 Tim Peters <tim.peters@gmail.com> wrote:
It's not. Cyclic GC needs its own linked lists *because* the allocator doesn't allow it to iterate over allocated objects. Regards Antoine.

[Tim]
[Antoine]
It's not. Cyclic GC needs its own linked lists *because* the allocator doesn't allow it to iterate over allocated objects.
The doubly linked lists in gc primarily support efficient _partitioning_ of objects for gc's purposes (a union of disjoint sets, with constant-time moving of an object from one set to another, and constant-time union of disjoint sets). "All objects" is almost never interesting to it (it is only when the oldest non-frozen generation is being collected). Between collections, the partitioning is by generation. During a collection, the youngest generations are combined into one, and then that's sub-partitioned in various ways as collection goes along, ending with a partition into reachable and unreachable objects. In between, ephemeral partitions are constructed (e.g., the set of "tentatively unreachable" objects). None of that was driven by obmalloc's (or malloc's) inability to iterate over objects. Doubly linked lists were the obvious way to implement the required operations on partitions efficiently and simply. In any case, it appears to be "a feature" now that people can use any flavor of the malloc family they like in CPython, so I expect that any attempt to tie cyclic gc to a specific flavor of malloc would be a very hard sell. Which, BTW, was the intended meaning of "independent": cyclic gc right now couldn't care less which version of malloc a user plugs in - nor could obmalloc care less which cyclic gc algorithm is used.

On Thu, 6 Jun 2019 17:26:17 -0500 Tim Peters <tim.peters@gmail.com> wrote:
Right. But the oldest generation is precisely the pain point, since full collections can induce very long pauses. IOW, perhaps it would be fine to keep dedicated lists for the young generations, while doing a heap walk for the full collection case.
Depends on the benefits of course ;-) Most people use some pre-built binaries that "naturally" come with obmalloc enabled. Of course, it's only a very small minority of applications that hit real performance issues with the GC - and in probably many of those cases tuning the full collection threshold can alleviate the problem still. Regards Antoine.

On 2019-06-06, Tim Peters wrote:
My current idea is to put partitioning flags on the interior radix tree nodes. If you mark an object as "finalizer reachable", for example, it would mark all the nodes on the path from the root with that flag. Then, when you want to iterate over all the GC objects with a flag, you can avoid uninteresting branches of the tree. For generations, maybe tracking them at the pool level is good enough. Interior nodes can track generations too (i.e. the youngest generation contained under them). My gut feeling is that the prev/next pointer updates done by move_unreachable() and similar functions must be quite expensive. Doing the traversal with an explicit stack is a lot less elegant but I think it should be faster. At least, when you are dealing with a big set of GC objects that don't fit in the CPU cache. Regards, Neil

On 2019-06-06, Tim Peters wrote:
We can almost make it work for GC objects, the use of obmalloc is quite well encapsulated. I think I intentionally designed the PyObject_GG_New/PyObject_GC_Del/etc APIs that way. Quick and dirty experiment is here: https://github.com/nascheme/cpython/tree/gc_malloc_free_size The major hitch seems my new gc_obj_size() function. We can't be sure the 'nbytes' passed to _PyObject_GC_Malloc() is the same as what is computed by gc_obj_size(). It usually works but there are exceptions (freelists for frame objects and tuple objects, for one) A nasty problem is the weirdness with PyType_GenericAlloc() and the sentinel item. _PyObject_GC_NewVar() doesn't include space for the sentinel but PyType_GenericAlloc() does. When you get to gc_obj_size(), you don't if you should use "nitems" or "nitems+1". I'm not sure how the fix the sentinel issue. Maybe a new type slot or a type flag? In any case, making a change like my git branch above would almost certainly break extensions that don't play nicely. It won't be hard to make it a build option, like the original gcmodule was. Then, assuming there is a performance boost, people can enable it if their extensions are friendly.
If we can make the above idea work, you could set the pool size to 8K without issue. A possible problem is that the obmalloc and gcmalloc arenas are separate. I suppose that affects performance testing.
You are likely correct. I'm hoping to benchmark the radix tree idea. I'm not too far from having it working such that it can replace address_in_range(). Maybe allocating gc_refs as a block would offset the radix tree cost vs address_in_range(). If the above idea works, we know the object size at free() and realloc(), we don't need address_in_range() for those code paths. Regards, Neil

To be clearer, while knowing the size of allocated objects may be of some use to some other allocators, "not really" for obmalloc. That one does small objects by itself in a uniform way, and punts everything else to the system malloc family. The _only_ thing it wants to know on a free/realloc is which of those two ways delivered the address to begin with. Knowing the original size would be an indirect way to accomplish that, but it really only needs 1 bit of info. If you're using a radix tree to implement - in effect - a dict mapping addresses to "stuff", 1 flag bit in the "stuff" would be ideal for that. If you haven't already, I assume you'll soon come around to believe you really should be tracking the addresses of all gc'ed objects (not just the "small" ones).
If the above idea works, we know the object size at free() and realloc(), we don't need address_in_range() for those code paths.
Two things: first, a single bit would not only be sufficient, it would be _better_ than knowing the object size. If the bit says the object came from the system, the size is of no use at all. It the bit says it came from obmalloc, what's _really_ wanted is the index of the object's "size class" in the vector of free lists, and that's already directly available in the object's pool header (various parts of which have to be read up &/or updated regardless). Second, if that bit were available, `address_in_range()` could be thrown away - obmalloc's free and realloc entries are the only places it's used (or is of any conceivable use to obmalloc). For the current obmalloc, I have in mind a different way (briefly. let s pool span a number of 4K pages, but teach pools about the page structure so that address_in_range() continues to work after trivial changes (read the arena index from the containing page's start rather than from the containing pool's)). Not ideal, but looks remarkably easy to do, and captures the important part (more objects in a pool -> more times obmalloc can remain in its fastest "all within the pool" paths).
At the start, it was the _reachable_ objects that were moved out of the collected generation list rather than the unreachable objects. Since it's (in all my experience) almost all objects that survive almost all collections in almost all programs, that was an enormous amount of moving overall. So long I ago I rewrote that code to move the _un_reachable objects instead. Debug output showed countless billions of pointer updates saved across the programs I tried at the time, but the timing difference was in the noise. Probably a big part of "the problem": when collecting the oldest generation in a program with "a lot" of objects, no approach at all will "fit in cache". We have to crawl the entire object graph then. Note that the test case in the parent thread had over a billion class instances, and it was way whittled down(!) from the OP's real program. But I don't _think_ that's typical quite yet ;-) , and:
I certainly agree gc_refs is the big-bang-for-the-buck thing to target. There's no way to avoid mutating that bunches and bunches of times, even in small programs, and reducing the number of dirtied cache lines due to that "should" pay off.

[Tim\
And now there's a PR that removes obmalloc's limit on pool sizes, and, for a start, quadruples pool (and arena!) sizes on 64-bit boxes: https://github.com/python/cpython/pull/13934 https://bugs.python.org/issue37211 As the PR says, """ It would be great to get feedback from 64-bit apps that do massive amounts of small-object allocations and deallocations. """ :-)

On 2019-06-09, Tim Peters wrote:
And now there's a PR that removes obmalloc's limit on pool sizes, and, for a start, quadruples pool (and arena!) sizes on 64-bit boxes:
Neat.
I've done a little testing the pool overhead. I have an application that uses many small dicts as holders of data. The function: sys._debugmallocstats() is useful to get stats for the obmalloc pools. Total data allocated by obmalloc is 262 MB. At the 4*PAGE_SIZE pool size, the wasted space due to partly filled pools is only 0.18%. For 16*PAGE_SIZE pools, 0.71%. I have a set of stats for another program. In that case, total memory allocated by obmalloc is 14 MB. For 4*PAGE_SIZE pools, wasted space is 0.78% of total. At 16*PAGE_SIZE, it is 2.4%. Based on that small set of data, using 4*PAGE_SIZE seems conservative. As I'm sure you realize, making pools bigger will waste actual memory, not just virtual address space because you write the arena pointer to each OS page. I want to do performance profiling using Linux perf. That should show where the hotspot instructions in the obmalloc code. Maybe that will be useful to you. Another thought about address_in_range(): some operating systems allow you to allocate memory a specific alignments. Or, you can even allocate a chunk of memory at a fixed memory location if you do the correct magic incantation. I noticed that Go does that. I imagine doing that has a bunch of associated challenges with it. However, if we could control the alignment and memory location of obmalloc arenas, we would not have the segv problem of address_in_range(). It's probably not worth going down that path due to the problems involved. Regards, Neil

[Neil Schemenauer <nas-python@arctrix.com>]
Definitely a tradeoff here: increasing the number of pages per pool is a pure win for objects of very popular size classes, but a pure loss for objects of unpopular size classes. New comments about POOL_SIZE in the patch briefly hint at that.
Based on that small set of data, using 4*PAGE_SIZE seems conservative.
Which is a nice way of saying "pulled out of Tim's ass" ;-)
Yes, but mostly no! It remains the case that obmalloc neither writes nor reads anything in an arena until a malloc/realloc call actually needs to return a pointer to a never-before accessed page. Writing the arena index is NOT done to pages when the arena is allocated, or even when a new pool is carved off an arena, but lazily, one page at a time, lowest address to highest, as fresh pages actually need to be returned to the caller. So arena pointers aren't actually written more frequently than when using pools of just one page, as is done now (which _still_ writes the arena index into each page). Subtlety: currently, if you pass a nonsense address to address_in_range that happens to be in one of the arena's pools, address_in_range() says "yes". However, after the patch, it may well return "no" if the address is in one of the pool's pages that hasn't yet been returned to a malloc/realloc caller (in which case the arena index hasn't yet been stored at the start of the page). I don't care about that, because it's 100% a logic error to pass an address to free/realloc that wasn't obtained from a previous malloc/realloc call. So it's a change in undefined behavior.
Would be good to know, yes! But may also depend on the app.
I'm unclear on how that could be exploited. It's not addresses that come from arenas that create segv headaches, it's addresses that come from the system's malloc/realloc. If we were, e.g., to pick a maximum amount of address space obmalloc can use in advance, we could live with a single arena of that size, and just check whether an address is within it. All the problems come from that address space may alternate, any number of times, between addresses in obmalloc arenas and addresses from the system malloc's internals.

On Sun, Jun 2, 2019 at 7:57 AM Tim Peters <tim.peters@gmail.com> wrote:
Is this really feasible in a world where the allocators can be selected (and the default changed) at runtime? And what would be an efficient way of detecting allocations punted to malloc, if not address_in_range? Getting rid of address_in_range sounds like a nice idea, and I would love to test how feasible it is -- I can run such a change against a wide selection of code at work, including a lot of third-party extension modules, but I don't see an easy way to do it right now. -- Thomas Wouters <thomas@python.org> Hi! I'm an email virus! Think twice before sending your email to help me spread!

[Tim]
[Thomas Wouters <thomas@python.org>]
Is this really feasible in a world where the allocators can be selected (and the default changed) at runtime?
I think so. See the "Memory Management" section of the Python/C API Reference Manual. It's always been "forbidden" to, e.g., allocate a thing with PyMem_New() but release it with free(). Ditto mixing a PyMem_Raw... allocator with a PyMem... deallocator, or PyObject... one. Etc. A type's tp_dealloc implementation should damn well which memory family the type's allocator used, However, no actual proposal on the table changes any "fact on the ground" here. They're all as forgiving of slop as the status quo.
And what would be an efficient way of detecting allocations punted to malloc, if not address_in_range?
_The_ most efficient way is the one almost all allocators used long ago: use some "hidden" bits right before the address returned to the user to store info about the block being returned. Like 1 bit to distinguish between "obmalloc took this out of one of its pools" and "obmalloc got this from PyMem_Raw... (whatever that maps to - obmalloc doesn't care)". That would be much faster than what we do now. But on current 64-bit boxes, "1 bit" turns into "16 bytes" to maintain alignment, so space overhead becomes 100% for the smallest objects obmalloc can return :-( Neil Schemenauer takes a different approach in the recent "radix tree arena map for obmalloc" thread here. We exchanged ideas on that until it got to the point that the tree levels only need to trace out prefixes of obmalloc arena addresses. That is, the new space burden of the radix tree appears quite reasonably small. It doesn't appear to be possible to make it faster than the current address_in_range(), but in small-scale testing so far speed appears comparable.
Neil's branch is here: https://github.com/nascheme/cpython/tree/obmalloc_radix_tree It's effectively a different _implementation_ of the current address_in_range(), one that doesn't ever need to read possibly uninitialized memory, and couldn't care less about the OS page size. For the latter reason, it's by far the clearest way to enable expanding pool size above 4 KiB. My PR also eliminates the pool size limitation: https://github.com/python/cpython/pull/13934 but at the cost of breaking bigger pools up internally into 4K regions so the excruciating current address_in_range black magic still works. Neil and I are both keen _mostly_ to increase pool and arena sizes. The bigger they are, the more time obmalloc can spend in its fastest code paths. A question we can't answer yet (or possibly ever) is how badly that would hurt Python returning arenas to the system, in long-running apps the go through phases of low and high memory need. I don't run anything like that - not a world I've ever lived in. All my experiments so far say, for programs that are neither horrible nor wonderful in this respect: 1. An arena size of 4 KiB is most effective for that. 2. There's significant degradation in moving even to 8 KiB arenas. 3. Which continues getting worse the larger the arenas. 4. Until reaching 128 KiB, at which point the rate of degradation falls a lot. So the current 256 KiB arenas already suck for such programs. For "horrible" programs, not even tiny 4K arenas help much. For "wonderful" programs, not even 16 MiB arenas hurt arena recycling effectiveness. So if you have real programs keen to "return memory to the system" periodically, it would be terrific to get info about how changing arena size affects their behavior in that respect. My PR uses 16K pools and 1M arenas, quadrupling the status quo. Because "why not?" ;-) Neil's branch has _generally_, but not always, used 16 MiB arenas. The larger the arenas in his branch, the smaller the radix tree needs to grow.

On 2019-06-21, Tim Peters wrote:
If you can test vs some real-world programs, that would be great. I was trying to run some tests this afternoon. Testing with Python 3.8+ is a pain because of the PyCode_New and tp_print changes. I've just added two fixes to the head of the obmalloc_radix_tree branch so that you can compile code generated by old versions of Cython. Without those fixes, building 3rd party extensions can be a real pain.
Currently I have it like your big pool branch (16 KB, 1MB).

For those who would like to test with something compatible with Python 3.7.3, I made re-based branches here: https://github.com/nascheme/cpython/tree/obmalloc_radix_v37 https://github.com/nascheme/cpython/tree/obmalloc_big_pools_v37 They should be ABI compatible with Python 3.7.3. So, if you just re-build the "python" executable, you don't have to rebuild anything else. Both those use the same arena/pool sizes and they both have Tim's arena thrashing fix.

On Fri, 21 Jun 2019 17:18:18 -0500 Tim Peters <tim.peters@gmail.com> wrote:
There's a fundamental problem here: you can't be sure that all allocators reserve such space. If some allocator doesn't, it can return a pointer just at the very start of the page, and if obmalloc tries to look at "a few bits before" that address, it could very well page-fault.
One open question is the cache efficiency of the two approaches. Intuitively, address_in_range() will often look at exactly the same cache line (since a significant number of allocation will share the same "page prefix"). Apparently, this benefit may be offset by cache aliasing issues. Cache aliasing can also be mitigated by the fact that CPU caches are most of the time N-way with N > 1 (but N generally remains small, from 2 to 8, for L1 and L2 caches). So I guess the a priori answer is "it's complicated" :-) I must also thank both you and Neil for running these experiments, even though sometimes I may disagree about the conclusions :-) Regards Antoine.

[Thomas]
And what would be an efficient way of detecting allocations punted to malloc, if not address_in_range?
[Tim]
[Antoine]
I snipped some technical but crucial context in my reply to Thomas: this was assuming users are following the documented requirement to never mix memory calls from different families. What you describe certainly could happen in "illegal" code that. e.g., obtained a block from the system malloc, but passed it to obmalloc to free. Which, in reality, works fine today, although the docs forbid it. (I'm not sure, but there _may_ be some mode of running today that actually enforces the doc requirements.) In the other world, where code is playing by the rules, if an obmalloc malloc/;realloc spelling is called, and it needs to punt to a different allocator, no problem: first it boosts the size request so it has room to store "the bit" it needs before the address it actually returns to the client. Then it's "legal" only to free that memory with an obmalloc spelling of free() later - obmalloc reads "the bit", sees "oh - that's not my memory!", and adjusts the pointer accordingly on _its_ call to the spelling of free() corresponding to the memory family obmalloc() used to get the memory to begin with.
Neil Schemenauer takes a different approach in the recent "radix tree arena map for obmalloc" thread here. ...
I believe we haven't seen a program yet that used more than one node at the tree's top level :-) But who knows? mmap() and VirtualAlloc() don't appear to make any _guarantees_ that the high-order bits of returned addresses aren't entirely random. In real life so far, they always appear to be zeroes. While x86 has a 64-bit virtual address space, the hardware "only" supports 48 bits of physical address space, and I haven't seen a virtual address yet where any of the top 16 bits are set. AMD requires that the top 16 bits of virtual addresses be copies of bit 2**47. Blah blah blah - for the foreseeable future, the top level of the tree has a very easy job. And Neil keenly observed that the top level of the tree can be _declared_ as being very broad (and so suck up a lot of the leading bits), because it's a file static and is effectively just an address space reservation (at least on Linux) until nodes in it actually get used.
Indeed it is.
I must also thank both you and Neil for running these experiments, even though sometimes I may disagree about the conclusions :-)
Well, there aren't any conclusions yet - just seeing the same things repeatedly. Over the weekend, Neil ran many variations of a longish-running "real job" related to his work, which goes through phases of processing bulk database operations and "trying to" release the space (details are complicated). Arena recycling was essentially non-existent in either branch (my PR or the radix tree). In 3.7.3 it _appeared_ to recycle hundreds of thousands of arenas, but on closer examination they were virtually all of the "useless arena thrashing" flavor. The arenas in use were almost always within one of the highwater mark. But it got much better in the branches if arenas were shrunk to a tiny 8 KiB. Which is just another instance of the "256 KiB arenas are already way too fat for effective arena recycling unless the program is exceptionally friendly in its dynamic memory use patterns" observation. We haven't seen a case yet where 1 MiB arenas do materially worse than 256 KiB ones in this respect. Speed is generally a wash between the branches, although they consistently appear to be faster (by a little, not a lot) than 3.7.3. The radix tree generally appears to be a little more memory-frugal than my PR (presumably because my need to break "big pools" into 4K chunks, while the tree branch doesn't, buys the tree more space to actually store objects than it costs for the new tree). We need more "real jobs", and especially ones where arena recycling is actually doing good in 3.7.3 (which can be hard to tell, because the "number of arenas recycled" output in 3.7.3 is vastly inflated by arena-thrashing noise).

[Tim]
It depends a whole lot on the size classes of the most popular objects. A program below to compute it all. For a 64-bit box using 3.8 alignment, and 16 KiB pools: pages per pool 4 pool size 16,384 alignment 16 The first output block: size 16 SQ 1012 1.2% PR 1018 0.6% RT 1021 0.3% SQ is the status quo: we have to use four separate 4 KiB pools, and each burns 48 bytes for a pool header. PR: my PR. There's one pool spanning 4 pages, with 48 bytes for a pool header in the first page, and 16 bytes to store the arena index in each of the other 3 pages. RT: the radix tree. One 16 KiB block that only "wastes" 48 bytes for the pool header. The first number on each line is the number of objects we can get from a "big pool" under that scheme. The second number is the % of total pool space that cannot be use for client objects. So, in the above, PR is a substantial improvement over SQ, and RT a less substantial improvement over PR. Regardless of size class, PR never does worse than SQ, and RT never worse than PR. The most dramatic difference is in the largest size class: size 512 SQ 28 12.5% PR 28 12.5% RT 31 3.1% RT is a huge win there. And while it's generally true that RT's advantages are more striking in the larger size classes, it doesn't always crush. For example, in the 2nd-largest size class, it doesn't matter at all which scheme is used: size 496 SQ 32 3.1% PR 32 3.1% RT 32 3.1% However, in the 3rd-largest size class, RT crushes it again: size 480 SQ 32 6.2% PR 32 6.2% RT 34 0.4% I trust the general principle at work here is too obvious to need explanation ;-) Anyway, these differences can really add up when there are millions of objects passed out from a size class where RT has an advantage. That's very attractive to me. On the other hand, this _effectively_ make arenas even larger (they can contain more objects), which apparently makes it even less likely that arenas can eventually be returned to the system. But, on the third hand, I've seen no evidence yet that increasing arena size matters much at all to that after hitting 128 KiB arenas (smaller than what we already use). "Uncooperative" programs essentially recycle nothing regardless, and "happy" programs essentially recycle almost as many arena bytes with 1 MiB arenas as with 8 KiB arenas. Here's the code: PAGES_PER_POOL = 4 ALIGNMENT = 16 # change to 8 for < Python 3.8 PAGE_SIZE = 2**12 POOL_SIZE = PAGE_SIZE * PAGES_PER_POOL POOL_HEADER_SIZE = 48 def from_block(size, blocksize, overhead): return (blocksize - overhead) // size def from_first_page(size, *, pagesize=PAGE_SIZE): return from_block(size, pagesize, POOL_HEADER_SIZE) # using multiple 4K one-page pools - status quo def nobj_4K(size): return from_first_page(size) * PAGES_PER_POOL # using the PR def nobj_PR(size): return (from_first_page(size) + from_block(size, PAGE_SIZE, ALIGNMENT) * (PAGES_PER_POOL - 1)) # using the radix tree branch def nobj_RT(size): return from_first_page(size, pagesize=POOL_SIZE) print("pages per pool", PAGES_PER_POOL) print(f"pool size {POOL_SIZE:,}") print("alignment", ALIGNMENT) for size in range(ALIGNMENT, 512 + 1, ALIGNMENT): print("size", size) for tag, f in (("SQ", nobj_4K), ("PR", nobj_PR), ("RT", nobj_RT), ): nobj = f(size) waste = POOL_SIZE - nobj * size print(f" {tag} {nobj:4} {waste/POOL_SIZE:5.1%}")

For the record, there's another contender in the allocator competition now: https://github.com/microsoft/mimalloc/ Regards Antoine. On Mon, 24 Jun 2019 00:20:03 -0500 Tim Peters <tim.peters@gmail.com> wrote:

[Antoine Pitrou <solipsis@pitrou.net>]
Thanks! From a quick skim, most of it is addressing things obmalloc doesn't: 1) Efficient thread safety (we rely on the GIL). 2) Directly handling requests of all sizes (we only deal with <= 512 bytes). 3) Funky stuff to help refcounting languages that want to guarantee that freeing an object takes bounded time (the problem is that large objects can contain an unbounded number of other objects that will then also die - so mimalloc has complications to interact with client-supplied functions that parcel out the tree of objects killed by a decref, so the freeing can be done incrementally over time). I don't think it's been explicitly pointed out before, but #2 is an alternative to my PR and Neil's radix tree. If we insist that programs play by the rules, and obmalloc controls every piece of memory it hands out, then the new definition of address_in_range() becomes #define address_in_range(p, pool) true ;-) This deserves more thought than I've ever given to it. Leaving aside all the above, they emphasize this: """ Many allocators use a single free list per size class which can lead to bad spatial locality where objects belonging to a single structure can be spread out over the entire heap. ... To improve the spatial locality of allocation, mimalloc use free list sharding where the heap is divided into pages (per size-class) with a free list per page (where pages are usually 64KiB for small objects). """ Which is exactly what obmalloc already does, except we call their "pages" "pools", and ours are 16x smaller (<= Python 3.8) or 4x smaller (my PR). The call our "arenas" "segments", and theirs are a minimum of 4 MiB. Their motivations for making these "large" are the same as mine: maximize the time they can stay in the very zippy "fast paths". Something I'm torn on: within a pool, obmalloc has two ways to find the next free block. The pool has its own block free list (as in mimalloc), but _also_ has a "bump allocator": The pool is carved into blocks one at a time, low address to high, whenever the pool's free list is NULL. This was part of Vladimir's desire to never touch a piece of memory before it's actually needed to satisfy a client request. But it does complicate and slow the code on the second-fastest allocation path (when the pool's free list _is_ NULL). The conditional "OK, the free list is NULL - so is there room for another bump allocation?" is a waste of time after all the pool's blocks have been passed out at least once. The bump allocator can never succeed again then, but we keep testing for it forever. The alternative is to break the pool into blocks, and link them into a free list, in one gulp when a size class is assigned to a free pool. That mutates all the blocks in the pool (to store the list pointers), even if all but one will never be used. In return, the code for testing whether the bump allocator can find room, and the pool header members supporting the bump allocator, could be thrown out. That's what mimalloc appears to do, and they said it sped things up a bit overall. But I think I'll leave that one for someone else ;-)

On Tue, Jun 25, 2019 at 5:49 AM Antoine Pitrou <solipsis@pitrou.net> wrote:
It's a very strong competitor! $ ./python -m pyperf compare_to pymalloc.json mimalloc.json -G --min-speed=3 Faster (14): - spectral_norm: 202 ms +- 5 ms -> 176 ms +- 3 ms: 1.15x faster (-13%) - unpickle: 19.7 us +- 1.9 us -> 17.6 us +- 1.3 us: 1.12x faster (-11%) - json_dumps: 17.1 ms +- 0.2 ms -> 15.7 ms +- 0.2 ms: 1.09x faster (-8%) - json_loads: 39.0 us +- 2.6 us -> 36.2 us +- 1.1 us: 1.08x faster (-7%) - crypto_pyaes: 162 ms +- 1 ms -> 150 ms +- 1 ms: 1.08x faster (-7%) - regex_effbot: 3.62 ms +- 0.04 ms -> 3.38 ms +- 0.01 ms: 1.07x faster (-7%) - pickle_pure_python: 689 us +- 53 us -> 650 us +- 5 us: 1.06x faster (-6%) - scimark_fft: 502 ms +- 2 ms -> 478 ms +- 2 ms: 1.05x faster (-5%) - float: 156 ms +- 2 ms -> 149 ms +- 1 ms: 1.05x faster (-5%) - pathlib: 29.0 ms +- 0.5 ms -> 27.7 ms +- 0.4 ms: 1.05x faster (-4%) - mako: 22.4 ms +- 0.1 ms -> 21.6 ms +- 0.2 ms: 1.04x faster (-4%) - scimark_sparse_mat_mult: 6.21 ms +- 0.04 ms -> 5.99 ms +- 0.08 ms: 1.04x faster (-3%) - xml_etree_parse: 179 ms +- 5 ms -> 173 ms +- 3 ms: 1.04x faster (-3%) - sqlalchemy_imperative: 42.0 ms +- 0.9 ms -> 40.7 ms +- 0.9 ms: 1.03x faster (-3%) Benchmark hidden because not significant (46): ...

On Thu, Jul 4, 2019 at 8:09 PM Antoine Pitrou <solipsis@pitrou.net> wrote:
Ah, interesting. Were you able to measure the memory footprint as well?
Hmm, it is not good. mimalloc uses MADV_FREE so it may affect to some benchmarks. I will look it later. ``` $ ./python -m pyperf compare_to pymalloc-mem.json mimalloc-mem.json -G Slower (60): - logging_format: 10.6 MB +- 384.2 kB -> 27.2 MB +- 21.3 kB: 2.58x slower (+158%) - logging_simple: 10028.4 kB +- 371.2 kB -> 22.2 MB +- 24.9 kB: 2.27x slower (+127%) - regex_dna: 13.3 MB +- 19.1 kB -> 27.0 MB +- 12.1 kB: 2.02x slower (+102%) - json_dumps: 8351.8 kB +- 19.8 kB -> 15.2 MB +- 18.0 kB: 1.87x slower (+87%) - regex_v8: 8434.6 kB +- 20.9 kB -> 14.4 MB +- 11.0 kB: 1.75x slower (+75%) - unpack_sequence: 7521.0 kB +- 17.0 kB -> 9980.8 kB +- 24.7 kB: 1.33x slower (+33%) - hexiom: 7412.2 kB +- 19.0 kB -> 9307.4 kB +- 8004 bytes: 1.26x slower (+26%) - xml_etree_process: 12.2 MB +- 26.3 kB -> 15.0 MB +- 28.9 kB: 1.23x slower (+23%) - genshi_text: 10.2 MB +- 24.0 kB -> 12.5 MB +- 24.8 kB: 1.22x slower (+22%) - crypto_pyaes: 7602.2 kB +- 35.7 kB -> 9242.8 kB +- 7873 bytes: 1.22x slower (+22%) - tornado_http: 24.9 MB +- 72.1 kB -> 30.1 MB +- 33.0 kB: 1.21x slower (+21%) - chameleon: 15.8 MB +- 24.5 kB -> 19.1 MB +- 23.4 kB: 1.21x slower (+21%) - genshi_xml: 10.9 MB +- 24.0 kB -> 12.9 MB +- 19.6 kB: 1.18x slower (+18%) - go: 8662.6 kB +- 16.4 kB -> 10082.8 kB +- 26.2 kB: 1.16x slower (+16%) - pathlib: 8863.6 kB +- 30.2 kB -> 10229.8 kB +- 19.4 kB: 1.15x slower (+15%) - scimark_fft: 7473.4 kB +- 14.4 kB -> 8606.0 kB +- 28.6 kB: 1.15x slower (+15%) - scimark_lu: 7463.2 kB +- 15.1 kB -> 8569.8 kB +- 28.6 kB: 1.15x slower (+15%) - pidigits: 7380.2 kB +- 20.0 kB -> 8436.0 kB +- 24.2 kB: 1.14x slower (+14%) - scimark_monte_carlo: 7354.4 kB +- 18.2 kB -> 8398.8 kB +- 27.0 kB: 1.14x slower (+14%) - scimark_sparse_mat_mult: 7889.8 kB +- 20.1 kB -> 9006.2 kB +- 29.4 kB: 1.14x slower (+14%) - scimark_sor: 7377.2 kB +- 18.9 kB -> 8402.0 kB +- 29.0 kB: 1.14x slower (+14%) - chaos: 7693.0 kB +- 33.0 kB -> 8747.6 kB +- 10.5 kB: 1.14x slower (+14%) - richards: 7364.2 kB +- 29.8 kB -> 8331.4 kB +- 20.2 kB: 1.13x slower (+13%) - raytrace: 7696.0 kB +- 30.3 kB -> 8695.4 kB +- 30.0 kB: 1.13x slower (+13%) - sqlite_synth: 8799.2 kB +- 25.5 kB -> 9937.4 kB +- 27.1 kB: 1.13x slower (+13%) - logging_silent: 7533.8 kB +- 32.0 kB -> 8488.2 kB +- 25.1 kB: 1.13x slower (+13%) - json_loads: 7317.8 kB +- 22.7 kB -> 8215.2 kB +- 21.5 kB: 1.12x slower (+12%) - unpickle_list: 7513.4 kB +- 9790 bytes -> 8420.6 kB +- 25.6 kB: 1.12x slower (+12%) - unpickle: 7519.8 kB +- 11.4 kB -> 8425.4 kB +- 27.1 kB: 1.12x slower (+12%) - fannkuch: 7170.0 kB +- 14.9 kB -> 8033.0 kB +- 22.5 kB: 1.12x slower (+12%) - pickle_list: 7514.6 kB +- 18.2 kB -> 8414.6 kB +- 24.0 kB: 1.12x slower (+12%) - telco: 7685.2 kB +- 15.0 kB -> 8598.2 kB +- 17.6 kB: 1.12x slower (+12%) - nbody: 7214.8 kB +- 10.7 kB -> 8070.2 kB +- 19.5 kB: 1.12x slower (+12%) - pickle: 7523.2 kB +- 12.4 kB -> 8415.0 kB +- 21.0 kB: 1.12x slower (+12%) - 2to3: 7171.2 kB +- 35.8 kB -> 8016.4 kB +- 21.7 kB: 1.12x slower (+12%) - nqueens: 7425.2 kB +- 21.8 kB -> 8296.8 kB +- 25.5 kB: 1.12x slower (+12%) - spectral_norm: 7212.6 kB +- 19.6 kB -> 8052.8 kB +- 18.4 kB: 1.12x slower (+12%) - regex_compile: 8538.0 kB +- 21.0 kB -> 9528.6 kB +- 22.1 kB: 1.12x slower (+12%) - pickle_pure_python: 7559.8 kB +- 19.4 kB -> 8430.0 kB +- 25.6 kB: 1.12x slower (+12%) - unpickle_pure_python: 7545.4 kB +- 9233 bytes -> 8413.0 kB +- 15.6 kB: 1.11x slower (+11%) - float: 23.9 MB +- 22.5 kB -> 26.6 MB +- 19.7 kB: 1.11x slower (+11%) - sqlalchemy_imperative: 18.2 MB +- 46.2 kB -> 20.2 MB +- 36.5 kB: 1.11x slower (+11%) - regex_effbot: 7910.8 kB +- 15.1 kB -> 8804.8 kB +- 20.9 kB: 1.11x slower (+11%) - pickle_dict: 7563.4 kB +- 15.3 kB -> 8415.2 kB +- 19.3 kB: 1.11x slower (+11%) - sqlalchemy_declarative: 18.9 MB +- 40.2 kB -> 21.0 MB +- 26.4 kB: 1.11x slower (+11%) - xml_etree_parse: 11.8 MB +- 12.6 kB -> 13.0 MB +- 16.3 kB: 1.11x slower (+11%) - html5lib: 20.1 MB +- 44.9 kB -> 22.2 MB +- 46.6 kB: 1.10x slower (+10%) - xml_etree_iterparse: 12.0 MB +- 26.5 kB -> 13.2 MB +- 31.3 kB: 1.10x slower (+10%) - sympy_integrate: 36.4 MB +- 26.7 kB -> 40.0 MB +- 33.2 kB: 1.10x slower (+10%) - sympy_str: 37.2 MB +- 28.4 kB -> 40.7 MB +- 26.6 kB: 1.10x slower (+10%) - sympy_expand: 36.2 MB +- 19.9 kB -> 39.7 MB +- 25.1 kB: 1.09x slower (+9%) - mako: 15.3 MB +- 19.1 kB -> 16.7 MB +- 25.4 kB: 1.09x slower (+9%) - django_template: 19.3 MB +- 14.9 kB -> 21.0 MB +- 14.6 kB: 1.09x slower (+9%) - xml_etree_generate: 12.3 MB +- 39.5 kB -> 13.3 MB +- 26.9 kB: 1.08x slower (+8%) - deltablue: 8918.0 kB +- 19.8 kB -> 9615.8 kB +- 12.5 kB: 1.08x slower (+8%) - dulwich_log: 12.2 MB +- 102.9 kB -> 13.1 MB +- 26.6 kB: 1.08x slower (+8%) - meteor_contest: 9296.0 kB +- 11.9 kB -> 9996.8 kB +- 20.7 kB: 1.08x slower (+8%) - sympy_sum: 62.2 MB +- 20.8 kB -> 66.5 MB +- 21.1 kB: 1.07x slower (+7%) - python_startup: 7946.6 kB +- 20.4 kB -> 8210.2 kB +- 16.6 kB: 1.03x slower (+3%) - python_startup_no_site: 7409.0 kB +- 18.3 kB -> 7574.6 kB +- 21.8 kB: 1.02x slower (+2%) ``` -- Inada Naoki <songofacandy@gmail.com>

[Antoine Pitrou <solipsis@pitrou.net>]
Ah, interesting. Were you able to measure the memory footprint as well?
[Inada Naoki <songofacandy@gmail.com>]
Could you say more about what's being measured here? Like, if this is on Linux, is it reporting RSS? VSS? Something else? mimalloc is "most like" obmalloc among those I've looked at in recent weeks. I noted before that its pools (they call them "pages") and arenas (called "segments") are at least 16x larger than obmalloc uses (64 KiB minimum pool/page size, and 4 MiB minimum arena/segment size, in mimalloc). Like all "modern" 64-bit allocators, it cares little about reserving largish blobs of address space, so I'd _expect_ Linuxy VSS to zoom. But it also releases (in some sense, ;like MADV_FREE) physical RAM back to the system at a granularity far smaller than arena'segment. At an extreme, the SuperMalloc I linked to earlier reserves a 512 MiB vector at the start, so no program linking to that can consume less than half a gig of address space. But it only expects to _use_ a few 4 KiB OS pages out of that. mimalloc doesn't do anything anywhere near _that_ gonzo (& since mimalloc comes out of Microsoft, it never will - "total virtual memory" on Windows is a system-wide resource, limited to the sum of physical RAM + pagefile size - no "overcommit" is allowed).

I guess that INADA-san used pyperformance --track-memory. pyperf --track-memory doc: "--track-memory: get the memory peak usage. it is less accurate than tracemalloc, but has a lower overhead. On Linux, compute the sum of Private_Clean and Private_Dirty memory mappings of /proc/self/smaps. On Windows, get PeakPagefileUsage of GetProcessMemoryInfo() (of the current process): the peak value of the Commit Charge during the lifetime of this process." https://pyperf.readthedocs.io/en/latest/runner.html#misc On Linux, pyperf uses read_smap_file() of pyperf._memory: https://github.com/vstinner/pyperf/blob/master/pyperf/_memory.py # Code to parse Linux /proc/%d/smaps files. # # See http://bmaurer.blogspot.com/2006/03/memory-usage-with-smaps.html for # a quick introduction to smaps. # # Need Linux 2.6.16 or newer. def read_smap_file(): total = 0 fp = open(proc_path("self/smaps"), "rb") with fp: for line in fp: # Include both Private_Clean and Private_Dirty sections. line = line.rstrip() if line.startswith(b"Private_") and line.endswith(b'kB'): parts = line.split() total += int(parts[1]) * 1024 return total It spawns a thread which reads /proc/self/smaps every milliseconds and then report the *maximum*. Victor Le jeu. 4 juil. 2019 à 17:12, Tim Peters <tim.peters@gmail.com> a écrit :
-- Night gathers, and now my watch begins. It shall not end until my death.

[Victor Stinner <vstinner@redhat.com>]
So I'll take that as meaning essentially that it's reporting what RSS would report if it ignored shared pages (so peak # of private pages actually backed by physical RAM). Clear as mud how MADV_FREE affects that.

I found calibrated loop count is not stable so memory usage is very different in some benchmarks. Especially, RAM usage of logging benchmark is very relating to loop count: $ PYTHONMALLOC=malloc LD_PRELOAD=$HOME/local/lib/libmimalloc.so ./python bm_logging.py simple --track-memory --fast --inherit-environ PYTHONMALLOC,LD_PRELOAD -v Run 1: calibrate the number of loops: 512 - calibrate 1: 12.7 MB (loops: 512) Calibration: 1 warmup, 512 loops Run 2: 0 warmups, 1 value, 512 loops - value 1: 12.9 MB Run 3: 0 warmups, 1 value, 512 loops - value 1: 12.9 MB ... $ PYTHONMALLOC=malloc LD_PRELOAD=$HOME/local/lib/libmimalloc.so ./python bm_logging.py simple --track-memory --fast --inherit-environ PYTHONMALLOC,LD_PRELOAD -v -l1024 Run 1: 0 warmups, 1 value, 1024 loops - value 1: 21.4 MB Run 2: 0 warmups, 1 value, 1024 loops - value 1: 21.4 MB Run 3: 0 warmups, 1 value, 1024 loops - value 1: 21.4 MB ... -- Inada Naoki <songofacandy@gmail.com>

On Thu, Jul 4, 2019 at 11:32 PM Inada Naoki <songofacandy@gmail.com> wrote:
I think I understand why the mimalloc uses more than twice memory than the pymalloc + glibc malloc in logging_format and logging_simple benchmarks. These two benchmarks does like this: buf = [] # in StringIO for _ in range(10*1024): buf.append("important: some important information to be logged") s = "".join(buf) # StringIO.getvalue() s.splitlines() mimalloc uses size segregated allocator for ~512KiB. And size class is determined top three bits. On the other hand, list increases capacity by 9/8. It means next size class is used on each realloc. At last, all size classes has1~3 used/cached memory blocks. This is almost worst case for mimalloc. In more complex application, there may be more chance to reuse memory blocks. In complex or huge application, this overhead will become relatively small. It's speed is attractive. But for memory efficiency, pymalloc + jemalloc / tcmalloc may be better for common cases. Regards, -- Inada Naoki <songofacandy@gmail.com>

[Inada Naoki <songofacandy@gmail.com>, trying mimalloc]
Hmm, it is not good. mimalloc uses MADV_FREE so it may affect to some benchmarks. I will look it later.
Often, but not always (multiplication by 9/8 may not change the top 3 bits - e.g., 128 * 9/8 = 144).
At last, all size classes has1~3 used/cached memory blocks.
No doubt part of it, but hard to believe it's most of it. If the loop count above really is 10240, then there's only about 80K worth of pointers in the final `buf`. To account for a difference of over 10M, it would need to have left behind well over 100 _full_ size copies from earlier reallocs. in fact the number of list elements across resizes goes like so: 0, 4, 8, 16, 25, 35, 46, ..., 7671, 8637, 9723, 10945 Adding all of those sums to 96,113, so accounts for less than 1M of 8-byte pointers if none were ever released. mimalloc will, of course, add its own slop on top of that - but not a factor of ten's worth. Unless maybe it's using a dozen such buffers at once? But does it really matter? ;-) mimalloc "should have" done MADV_FREE on the pages holding the older `buf` instances, so it's not like the app is demanding to hold on to the RAM (albeit that it may well show up in the app's RSS unless/until the OS takes the RAM away).
The mimalloc page says that, in their benchmarks: """ In our benchmarks (see below), mimalloc always outperforms all other leading allocators (jemalloc, tcmalloc, Hoard, etc), and usually uses less memory (up to 25% more in the worst case). """ obmalloc is certainly more "memory efficient" (for some meanings of that phrase) for smaller objects: in 3.7 it splits objects of <= 512 bytes into 64 size classes. mimalloc also has (close to) 64 "not gigantic" size classes, but those cover a range of sizes over a thousand times wider (up to about half a meg). Everything obmalloc handles fits in mimalloc's first 20 size classes. So mimalloc routinely needs more memory to satisfy a "small object" request than obmalloc does. I was more intrigued by your first (speed) comparison:
- spectral_norm: 202 ms +- 5 ms -> 176 ms +- 3 ms: 1.15x faster (-13%)
Now _that's_ interesting ;-) Looks like spectral_norm recycles many short-lived Python floats at a swift pace. So memory management should account for a large part of its runtime (the arithmetic it does is cheap in comparison), and obmalloc and mimalloc should both excel at recycling mountains of small objects. Why is mimalloc significantly faster? This benchmark should stay in the "fastest paths" of both allocators most often, and they both have very lean fastest paths (they both use pool-local singly-linked sized-segregated free lists, so malloc and free for both should usually amount to just popping or pushing one block off/on the head of the appropriate list). obmalloc's `address_in_range()` is definitely a major overhead in its fastest `free()` path, but then mimalloc has to figure out which thread is doing the freeing (looks cheaper than address_in_range, but not free). Perhaps the layers of indirection that have been wrapped around obmalloc over the years are to blame? Perhaps mimalloc's larger (16x) pools and arenas let it stay in its fastest paths more often? I don't know why, but it would be interesting to find out :-)

On Tue, Jul 9, 2019 at 9:46 AM Tim Peters <tim.peters@gmail.com> wrote:
You are right. List.append is not the major part of memory consumer of "large" class (8KiB+1 ~ 512KiB). They are several causes of large size alloc: * bm_logging uses StringIO.seek(0); StringIO.truncate() to reset buffer. So internal buffer of StringIO become Py_UCS4 array instead of a list of strings from the 2nd loop. This buffer is using same policy to list for increase capacity. `size + size >> 8 + (size < 9 ? 3 : 6)`. Actually, when I use `-n 1` option, memory usage is only 9MiB. * The intern dict. * Many modules are loaded, and FileIO.readall() is used to read pyc files. This creates and deletes various size of bytes objects. * logging module uses several regular expressions. `b'\0' * 0xff00` is used in sre_compile. https://github.com/python/cpython/blob/master/Lib/sre_compile.py#L320
mimalloc doesn't call madvice for each free(). Each size classes keeps a 64KiB "page". And several pages (4KiB) in the "page" are committed but not used. I dumped all "mimalloc page" stat. https://paper.dropbox.com/doc/mimalloc-on-CPython--Agg3g6XhoX77KLLmN43V48cfAg-fFyIm8P9aJpymKQN0scpp#:uid=671467140288877659659079&h2=memory-usage-of-logging_format For example: bin block_size used capacity reserved 29 2560 1 22 25 (14 pages are committed, 2560 bytes are in use) 29 2560 14 25 25 (16 pages are committed, 2560*14 bytes are in use) 29 2560 11 25 25 31 3584 1 5 18 (5 pages are committed, 3584 bytes are in use) 33 5120 1 4 12 33 5120 2 12 12 33 5120 2 12 12 37 10240 3 11 409 41 20480 1 6 204 57 327680 1 2 12 * committed pages can be calculated by `ceil(block_size * capacity / 4096)` roughly. There are dozen of unused memory block and committed pages in each size classes. This caused 10MiB+ memory usage overhead on logging_format and logging_simple benchmarks.
Totally agree. I'll investigate this next. Regards, -- Inada Naoki <songofacandy@gmail.com>

On Tue, Jul 9, 2019 at 5:29 PM Inada Naoki <songofacandy@gmail.com> wrote:
I compared "perf" output of mimalloc and pymalloc, and I succeeded to optimize pymalloc! $ ./python bm_spectral_norm.py --compare-to ./python-master python-master: ..................... 199 ms +- 1 ms python: ..................... 182 ms +- 4 ms Mean +- std dev: [python-master] 199 ms +- 1 ms -> [python] 182 ms +- 4 ms: 1.10x faster (-9%) mimalloc uses many small static (inline) functions. On the other hand, pymalloc_alloc and pymalloc_free is large function containing slow/rare path. PyObject_Malloc inlines pymalloc_alloc, and PyObject_Free inlines pymalloc_free. But compiler doesn't know which is the hot part in pymalloc_alloc and pymalloc_free. So gcc failed to chose code to inline. Remaining part of pymalloc_alloc and pymalloc_free are called and many push/pop are executed because they contains complex logic. So I tried to use LIKELY/UNLIKELY macro to teach compiler hot part. But I need to use "static inline" for pymalloc_alloc and pymalloc_free yet [1]. Generated assembly is optimized well, the hot code is in top of the PyObject_Malloc [2] and PyObject_Free [3]. But there are many code duplication in PyObject_Malloc and PyObject_Calloc, etc... [1] https://github.com/python/cpython/pull/14674/files [2] https://gist.github.com/methane/ab8e71c00423a776cb5819fa1e4f871f#file-obmall... [3] https://gist.github.com/methane/ab8e71c00423a776cb5819fa1e4f871f#file-obmall... I will try to split pymalloc_alloc and pymalloc_free to smaller functions. Except above, there is one more important difference. pymalloc return free pool to freepool list soon when pool become empty. On the other hand, mimalloc return "page" (it's similar to "pool" in pymalloc) not everytime when it's empty [4]. So they can avoid rebuilding linked list of free blocks. I think pymalloc should do same optimization. [4] https://github.com/microsoft/mimalloc/blob/1125271c2756ee1db1303918816fea35e... BTW, which is proper name? pymalloc, or obmalloc. Regards, -- Inada Naoki <songofacandy@gmail.com>

[Inada Naoki <songofacandy@gmail.com>, looking into why mimalloc did so much better on spectral_norm]
Yay!! Nice :-)
mimalloc uses many small static (inline) functions.
Too many , if you ask me ;-) Really, the enormous number of tiny functions makes the code very hard to follow for me. OTOH, the tiny number of enormous functions in pymalloc makes that hard to follow too :-(
On the other hand, pymalloc_alloc and pymalloc_free is large function containing slow/rare path.
obmalloc.c was written when compiler inlining was barely a thing. Some restructuring is a fine idea - overdue, in fact.
Yes, splitting out the slower paths should be a win. Note this is one reason I remain keen to increase pool size: the bigger the pools, the more objects can be handed out & reclaimed _from_ the fastest paths. You've now discovered that "the fastest paths" are, for pragmatic compiler-is-overwhelmed reasons, significantly slower than they "should be".
Except above, there is one more important difference.
pymalloc return free pool to freepool list soon when pool become empty.
Fine by me, & I think it should continue to do so. Details below.
pymalloc never returns anything to the system at smaller than "arena" granularity, so it's not a big deal at all _merely_ to link and unlink a pool to/from an arena's freepool list. There's no possibility than any byte in the pool can change while the pool is sitting on a freepools list. If we freed the last pool of a given size class, and next need to allocate another object of that size class (most common), it's cheap: when the pool is unlinked from the pool free list, it sees that the pool's last size class is the same as the size class being requested, and skips almost all pool initialization (the pool is already fully prepared for objects of this size class). init_pool: /* Frontlink to used pools. */ next = usedpools[size + size]; /* == prev */ pool->nextpool = next; pool->prevpool = next; next->nextpool = pool; next->prevpool = pool; pool->nalloc = 1; if (pool->szidx == size) { // xxxxxxxx HERE xxxxxxxc /* Luckily, this pool last contained blocks * of the same size class, so its header * and free list are already initialized. */ bp = pool->freeblock; assert(bp != NULL); pool->freeblock = *(block **)bp; goto success; // xxxxxxxxxxx FASTEST PATH xxxxxxxxxxxxxx } // and here there's a mound of code in case // the size classes don't match, including code // to restart the process of linking the pool's // blocks into a free list. So In the common case, the pool's free list is reused exactly as-is at the time the pool was linked to the freepool list.
I think pymalloc should do same optimization.
As above, I believe pymalloc is already optimal in this case, On the other end, something to consider: when a block is freed from an entirely used pool, the pool now contains one free block, and is linked to the front of usedpools[sizeclass]. So the next request for a block of that size will come from that pool, which will again cause the pool to become entirely used. That's a kind of possible thrashing, and acts against mimalloc's (sane, to my eyes) desire to _entirely_ use up a given pool ("page") once it starts passing things out from a pool. It could well be better to, e.g., link the pool-with-one-newly-free-block to the end of the usedpool list instead. But it's unclear. The reason it front-links is that the block being freed is presumably in L1 cache (because, e.g., its refcount just fell to 0), so best to hand it out again ASAP. But that old reasoning applied when pymalloc was _only_ used for PyObject-like structs.
[4] https://github.com/microsoft/mimalloc/blob/1125271c2756ee1db1303918816fea35e...
BTW, which is proper name? pymalloc, or obmalloc.
I think it's pymalloc, so that's what I used above, I got lazy before ;-)

On 2019-07-09, Inada Naoki wrote:
I think LIKELY/UNLIKELY is not helpful if you compile with LTO/PGO enabled. So, I would try that first. Also, if you use relatively small static functions that are defined before use (no forward declarations), I have found that GCC is usually smart about inlining them. So, I don't think you should have to use "static inline" rather than just "static". Good work looking into this. Should be some relatively easy performance win. Cheers, Neil

[Inada Naoki]
[Neil Schemenauer]
I think LIKELY/UNLIKELY is not helpful if you compile with LTO/PGO enabled.
I like adding those regardless of whether compilers find them helpful: they help _people_ reading the code focus on what's important to speed. While not generally crucial, speed is important in these very low-level, very heavily used functions. Speaking of which, another possible teensy win: pymalloc's allocation has always started with: if (nbytes == 0) { return 0; } if (nbytes > SMALL_REQUEST_THRESHOLD) { return 0; } size = (uint)(nbytes - 1) >> ALIGNMENT_SHIFT; But it could be a bit leaner: size_t fatsize = (nbytes - 1) >> ALIGNMENT_SHIFT; if (UNLIKELY(fatsize >= NB_SMALL_SIZE_CLASSES)) { return 0;' } size = (uint)fatsize; The `nbytes == 0` case ends up mapping to a very large size class then, although C may not guarantee that. But it doesn't matter: if it maps to "a real" size class, that's fine. We'll return a unique pointer into a pymalloc pool then, and "unique pointer" is all that's required. An allocation requesting 0 bytes does happen at times, but it's very rare. It just doesn't merit its own dedicated test-&-branch.
Good work looking into this. Should be some relatively easy performance win.
Ditto!

On 2019-07-09, Inada Naoki wrote:
Hello Inada, I don't see this on my PC. I'm using GCC 8.3.0. I have configured the build with --enable-optimizations. To speed up the profile generation, I have changed PROFILE_TASK to only run these tests: test_shelve test_set test_pprint test_pickletools test_ordered_dict test_tabnanny test_difflib test_pickle test_json test_collections I haven't spent much time trying to figure out what set of tests is best but the above set runs pretty quickly and seems to work okay. I have run pyperformance to compare CPython 'master' with your PR 14674. There doesn't seem to be a difference (table below). If I look at the disassembly, it seems that the hot paths of pymalloc_alloc and pymalloc_free are being inlined as you would hope, without needing the LIKELY/UNLIKELY annotations. OTOH, your addition of LIKELY() and UNLIKELY() in the PR is a pretty small change and probably doesn't hurt anything. So, I think it would be fine to merge it. Regards, Neil +-------------------------+---------+-----------------------------+ | Benchmark | master | PR-14674 | +=========================+=========+=============================+ | 2to3 | 305 ms | 304 ms: 1.00x faster (-0%) | +-------------------------+---------+-----------------------------+ | chaos | 109 ms | 110 ms: 1.01x slower (+1%) | +-------------------------+---------+-----------------------------+ | crypto_pyaes | 118 ms | 117 ms: 1.01x faster (-1%) | +-------------------------+---------+-----------------------------+ | django_template | 112 ms | 114 ms: 1.02x slower (+2%) | +-------------------------+---------+-----------------------------+ | fannkuch | 446 ms | 440 ms: 1.01x faster (-1%) | +-------------------------+---------+-----------------------------+ | float | 119 ms | 120 ms: 1.01x slower (+1%) | +-------------------------+---------+-----------------------------+ | go | 247 ms | 250 ms: 1.01x slower (+1%) | +-------------------------+---------+-----------------------------+ | json_loads | 25.1 us | 24.4 us: 1.03x faster (-3%) | +-------------------------+---------+-----------------------------+ | logging_simple | 8.86 us | 8.66 us: 1.02x faster (-2%) | +-------------------------+---------+-----------------------------+ | meteor_contest | 97.5 ms | 97.7 ms: 1.00x slower (+0%) | +-------------------------+---------+-----------------------------+ | nbody | 140 ms | 142 ms: 1.01x slower (+1%) | +-------------------------+---------+-----------------------------+ | pathlib | 19.2 ms | 18.9 ms: 1.01x faster (-1%) | +-------------------------+---------+-----------------------------+ | pickle | 8.95 us | 9.08 us: 1.02x slower (+2%) | +-------------------------+---------+-----------------------------+ | pickle_dict | 18.1 us | 18.0 us: 1.01x faster (-1%) | +-------------------------+---------+-----------------------------+ | pickle_list | 2.75 us | 2.68 us: 1.03x faster (-3%) | +-------------------------+---------+-----------------------------+ | pidigits | 182 ms | 184 ms: 1.01x slower (+1%) | +-------------------------+---------+-----------------------------+ | python_startup | 7.83 ms | 7.81 ms: 1.00x faster (-0%) | +-------------------------+---------+-----------------------------+ | python_startup_no_site | 5.36 ms | 5.36 ms: 1.00x faster (-0%) | +-------------------------+---------+-----------------------------+ | raytrace | 495 ms | 499 ms: 1.01x slower (+1%) | +-------------------------+---------+-----------------------------+ | regex_dna | 173 ms | 170 ms: 1.01x faster (-1%) | +-------------------------+---------+-----------------------------+ | regex_effbot | 2.79 ms | 2.67 ms: 1.05x faster (-4%) | +-------------------------+---------+-----------------------------+ | regex_v8 | 21.1 ms | 21.2 ms: 1.00x slower (+0%) | +-------------------------+---------+-----------------------------+ | richards | 68.2 ms | 68.7 ms: 1.01x slower (+1%) | +-------------------------+---------+-----------------------------+ | scimark_monte_carlo | 103 ms | 102 ms: 1.01x faster (-1%) | +-------------------------+---------+-----------------------------+ | scimark_sparse_mat_mult | 4.37 ms | 4.35 ms: 1.00x faster (-0%) | +-------------------------+---------+-----------------------------+ | spectral_norm | 132 ms | 133 ms: 1.01x slower (+1%) | +-------------------------+---------+-----------------------------+ | sqlalchemy_imperative | 30.3 ms | 30.7 ms: 1.01x slower (+1%) | +-------------------------+---------+-----------------------------+ | sympy_sum | 88.2 ms | 89.2 ms: 1.01x slower (+1%) | +-------------------------+---------+-----------------------------+ | telco | 6.63 ms | 6.58 ms: 1.01x faster (-1%) | +-------------------------+---------+-----------------------------+ | tornado_http | 178 ms | 179 ms: 1.01x slower (+1%) | +-------------------------+---------+-----------------------------+ | unpickle | 12.0 us | 12.4 us: 1.03x slower (+3%) | +-------------------------+---------+-----------------------------+ | unpickle_list | 3.93 us | 3.75 us: 1.05x faster (-4%) | +-------------------------+---------+-----------------------------+ Not significant (25): deltablue; dulwich_log; hexiom; html5lib; json_dumps; logging_format; logging_silent; mako; nqueens; pickle_pure_python; regex_compile; scimark_fft; scimark_lu; scimark_sor; sqlalchemy_declarative; sqlite_synth; sympy_expand; sympy_integrate; sympy_str; unpack_sequence; unpickle_pure_python; xml_etree_parse; xml_etree_iterparse; xml_etree_generate; xml_etree_process

On Wed, Jul 10, 2019 at 5:18 PM Neil Schemenauer <nas-python@arctrix.com> wrote:
I didn't use PGO and that's why GCC didn't know which part is hot. Maybe, pymalloc performance is similar to mimalloc when PGO is used, but I had not confirmed it. While Linux distributions are using PGO, some people use non-PGO Python (Homebrew, pyenv, etc...). So better performance without PGO is worth. Regards, -- Inada Naoki <songofacandy@gmail.com>

I did it and pymalloc is now as fast as mimalloc. $ ./python bm_spectral_norm.py --compare-to=./python-master python-master: ..................... 199 ms +- 1 ms python: ..................... 176 ms +- 1 ms Mean +- std dev: [python-master] 199 ms +- 1 ms -> [python] 176 ms +- 1 ms: 1.13x faster (-11%) I filed an new issue for this: https://bugs.python.org/issue37543

Le ven. 21 juin 2019 à 23:19, Thomas Wouters <thomas@python.org> a écrit :
Is this really feasible in a world where the allocators can be selected (and the default changed) at runtime?
The memory allocation must not be changed after the Python pre-initialization. What's done after pre-initialization is more to put "hook" which executes code before/after an allocation, but don't replace the allocator. It simply doesn't work to switch from pymalloc to malloc "at runtime". Calling PyMem_Free(ptr) would call free(ptr). If the memory block was allocated by pymalloc, free(ptr) does simply crash. Victor

[Inada Naoki <songofacandy@gmail.com>]
Which is 3 lines of code plus a closing brace. The OP didn't build their own Python, and the source from which it was compiled wasn't available to them. But they did say that when they got into gdb and single-stepped, it was looping through the same three lines of code, in obmalloc.c, over & over. Which is 100% consistent with the loop you identified as being "the hot spot".

On Fri, May 24, 2019 at 3:49 AM Gregory P. Smith <greg@krypto.org> wrote:
I suggest filing a bug to track this...
I created the issue: https://bugs.python.org/issue37029 -- Inada Naoki <songofacandy@gmail.com>

It seems pretty clear now that the primary cause is keeping arenas sorted by number of free pools. As deallocation goes on, the number of distinct "# of free pools" values decreases, leaving large numbers of arenas sharing the same value. Then, e.g., if there are 10,000 arenas with 30 free pools remaining, and another pool is freed in one of them, its free pool count jumps to 31, and on average we have to crawl over 5,000 of its former peers to locate its new position in the list. Which is also consistent with what the OP saw, and I did too but in a much smaller case: the longer deallocation goes on, the slower it gets (fewer and fewer Nodes freed per unit time). Which suggests a different - albeit related - approach: it's not really the order of arenas that matters, but the partitioning of arenas by number of free pools. So instead of a singly doubly linked list of arenas, have a vector of doubly linked lists, one list per possible number of free pools. That's a fixed and relatively small number. For example, even if all arena bytes could be used for pools (they can't be - there's bookkeeping info at the start of an arena), the maximum number of free pools in an arena would be 2**18 / 2**12 == 2**6 == 64. When a pool is freed, no big deal: unlink the arena from its current list, and stick it in the list for the next higher number of free pools. This, of course, is constant-time. Of course there are edge cases (e.g., if the area is entirely free, it should be given back to C instead), but they seem pretty obvious, and I haven't thought of a real problem. Tedious, though ;-)

On 5/22/19 12:15 PM, Tim Peters wrote:
I have a computer with two Xeon CPUs and 256GB of RAM. So, even though it's NUMA, I still have 128GB of memory per CPU. It's running a "spin" of Ubuntu 18.10. I compiled a fresh Python 3.7.3 --with-optimizations. I copied the sample program straight off the StackOverflow page. The program ran for about five and a half hours then exited normally. During the run it printed: This gets printed! This doesn't get printed Statistics reported by "time": 19811.05s user 123.56s system 99% cpu 5:32:15.04 total Checking in on it now and then, peak observed memory usage (as reported by "top") was just under 80GB. I take it that the interesting part was confirming that "This doesn't get printed" gets printed when you have enough RAM for the program to run to completion. So I guess there's no bug here? Just an observation about CPython's garbage collector being kinda slow? Or maybe CPython gc + swap = super bombad slow? //arry/

[Larry Hastings <larry@hastings.org>]
Thanks, Larry!
All of that is consistent with what others reported, although you've given more quantified details, and that's helpful.
I take it that the interesting part was confirming that "This doesn't get printed" gets printed when you have enough RAM for the program to run to completion.
That was the primary point at first, yes,
No need to confirm this: if you ran it again with a bit more output, you'd find that the vast bulk of the time was spent between the two output lines. That is, it takes hours from the time the train() function starts to return and completes returning. That's all consumed as a side effect of `tree` losing its last reference. We know why now, and it's hard to say whether it's "a bug". It's functioning as designed, but the design sucks ;-) So I'd call it a design flaw. The heuristics that were introduced to try to make it likely we could return more empty arenas to C were designed when machines had far less RAM, and can burn time quadratic in the number of arenas. That really didn't matter in a world with a few hundred arenas, but can be a timing disaster in a world with hundreds of thousands (or more) of arenas. The Stackoverflow test case was whittled way down from the OP's real program, which ran "for days" without completing. I've sketched two (albeit closely related, both to each other and to what we currently do) ways the worst-case overall time could be cut from quadratic to linear in the number of arenas, which is asymptotically optimal. More ambitious would be to dream up an entirely different heuristic. Keeping the list of usable arenas fully sorted in order of number of free pools seems a very strong requirement for a poke-and-hope heuristic. But, best I can tell, the person who came up with this heuristic to begin with didn't check in any motivating test programs. So we'd be starting over from scratch. For this specific Stackoverflow case, I've confirmed that it doesn't make a real difference if we didn't bother to sort the list of arenas _at all_ (then it still returns about the same number of arenas to C, both at the point the tree finishes building, and at the time it completes tearing down the tree). So I have a program showing the current scheme can be a disaster, but none where it shows it actually helps anything ;-)

I made a pull request for this that appears to work fine for my 10x smaller test case (reduces tear-down time from over 45 seconds to over 7). It implements my second earlier sketch (add a vector of search fingers, to eliminate searches): https://github.com/python/cpython/pull/13612 It would be good to know how it works on the OP's 10x larger test case, which requires a machine with over 80 GB of RAM. Hoping it will reduce tear-down time from hours to minutes (they have about a billion `Node` objects, so tear-down time will never be trivial). It would also be good to get a code review, if you have the time and inclination. Thanks!

The PR for this looks good to go: https://github.com/python/cpython/pull/13612 But, I still have no idea how it works for the OP's original test case. So, if you have at least 80 GB of RAM to try it, I added `arena.py` to the BPO report: https://bugs.python.org/issue37029 That adds code to the OP's test case to display the times needed to build the tree and to tear it down (& to display some obmalloc stats). So there's no need for you to think about anything ;-) I'm keen to get feedback on this before merging the PR, because this case is so very much larger than anything I've ever tried that I'm wary that there may be more than one "surprise" lurking here. The PR certainly addresses "an obvious" (with hindsight) problem - but is that the _only_ gross design flaw here?

[Tim]
[Inada Naoki <songofacandy@gmail.com>]
I started r5a.4xlarge EC2 instance and started arena.py. I will post the result in next 12 hours.
Thank you! Wrapping this up, Inada attached the results to the bug report. For the OP's original big case, the time to reclaim the tree storage dropped from nearly 5 hours to about 70 seconds. The time to build the tree didn't materially change (it was a bit faster after the patch, but offhand didn't look significantly faster to me). Since I called this a "design flaw" rather than "a bug", I'm not inclined to backport it. If someone else thinks it should be, that's up to them. I'm _assuming_ Github won't decide to backport it on its own - but maybe I'm wrong (Github workflow is still pretty magical to me).
participants (11)
-
Antoine Pitrou
-
Brett Cannon
-
Gregory P. Smith
-
Inada Naoki
-
Joni Orponen
-
Larry Hastings
-
Neil Schemenauer
-
Steve Dower
-
Thomas Wouters
-
Tim Peters
-
Victor Stinner