[Python-Dev] RE: Program very slow to finish
Tim Peters
tim.one@home.com
Tue, 6 Nov 2001 19:12:12 -0500
[Martin von Loewis]
> I was looking into making pymalloc the default a few weeks ago, when I
> noticed that allocations is pretty efficient on Linux. I studied glibc
> malloc a bit to see that it is indeed quite clever about allocation,
> even in the presence of MT locks (with per-arena locks, and recording
> a per-thread arena in a thread-local variable).
Is this Doug Lea's malloc implementation
<ftp://g.oswego.edu/pub/misc/malloc.c>? Can you tell whether it was the one
Roeland used in his "before" test case run?
> I believe the main difference in deallocation speed (which I didn't
> notice then) comes from the attempt to combine subsequent memory
> blocks in glibc; pymalloc doesn't attempt to do so.
That would be my guess; I traced bad deallocation performance to free()
coalescing in a different system long ago, and it seems a common problem.
> Also, there *is* a lock acquire/release cycle in glibc malloc, which
> may account for some overhead, too.
Sure. Doug Lea sez (see above) "If you are using malloc in a concurrent
program, you would be far better off obtaining ptmalloc ..."; slapping a
lock around a thread-naive malloc is merely correct.
> Looking at obmalloc.c, I see one aspect that I'd consider
> questionable: detection of pool-vs-system-malloc uses a heuristics,
> namely
>
> offset = (off_t )p & POOL_SIZE_MASK;
> pool = (poolp )((block *)p - offset);
> if (pool->pooladdr != pool || pool->magic != (uint )POOL_MAGIC) {
> _SYSTEM_FREE(p);
> return;
> }
>
> So if the pool header has the pool magic and points to itself, it
> really is a pool header. That may result in a system malloc being
> recognized as a pool malloc, under rare circumstances. As a result,
> the system heap will be corrupted.
Yes. I believe Vladimir offered to buy us lunch if that ever happened in a
real program, so he's betting at least US$4 that it won't -- and nothing
else in Python is backed by that much hard cash <wink>. The tradeoff is
that he gets a lot of comparative storage efficiency out of this trick, and
some speed too (no per-object doubly-linked free lists to maintain, no need
to store size fields). I don't see a way to make it bulletproof without
giving that up, short of over-allocating "large requests" enough so that a
hidden pool_header struct can be stuffed at a POOL_SIZE-aligned address.
Then the above could become
offset = (off_t )p & POOL_SIZE_MASK;
pool = (poolp )((block *)p - offset);
assert(pool->pooladdr == pool);
/* pool->magic member no longer exists; use, e.g.,
pool->szidx == (uint)-1 as a flag to mean "this block came
from the system malloc".
*/
if (pool->szidx == (uint)-1)) {
_SYSTEM_FREE(p);
return;
}
POOL_SIZE is 4KB, though, and that's a lot of wasted space. OTOH, a "large
object arena" could be introduced too, and carved up much like the current
one.
I suppose the bottom line is that you just can't make address tricks
bulletproof unless you arrange-- by hook or by crook --to control the
addresses you pass out (so you never call malloc at all, or wrap what it
does).
> Another aspect that may be troubling is that obmalloc never returns
> memory to the system.
Ah, but it does for "large objects" obtained via malloc() -- they get
free()d individually.
> It just detects free pools, and is willing to rearrange a free pool for
> use with a different object size. Of course, there is little chance
> that a complete arena will ever become empty, so this is probably not
> that bad.
It's also nothing really new, and *could* lead to an improvement: the hope
was that PyMalloc could be made fast enough that Python could drop all its
fiddly little type-specific free lists. Those never return memory either,
and, worse, can't even recycle memory across types; for example, if some
phase of your program uses a million floats at one time, and the next phase
only retains their sum, you're stuck forever with space for a million (minus
1 <wink>) dead float objects, sitting in floatobject.c's block_list.