[Python-Dev] RE: Program very slow to finish

Tue, 6 Nov 2001 19:12:12 -0500

[Martin von Loewis]
> I was looking into making pymalloc the default a few weeks ago, when I
> noticed that allocations is pretty efficient on Linux. I studied glibc
> malloc a bit to see that it is indeed quite clever about allocation,
> even in the presence of MT locks (with per-arena locks, and recording
> a per-thread arena in a thread-local variable).

Is this Doug Lea's malloc implementation
<ftp://g.oswego.edu/pub/misc/malloc.c>?  Can you tell whether it was the one
Roeland used in his "before" test case run?

> I believe the main difference in deallocation speed (which I didn't
> notice then) comes from the attempt to combine subsequent memory
> blocks in glibc; pymalloc doesn't attempt to do so.

That would be my guess; I traced bad deallocation performance to free()
coalescing in a different system long ago, and it seems a common problem.

> Also, there *is* a lock acquire/release cycle in glibc malloc, which
> may account for some overhead, too.

Sure.  Doug Lea sez (see above) "If you are using malloc in a concurrent
program, you would be far better off obtaining ptmalloc ..."; slapping a
lock around a thread-naive malloc is merely correct.

> Looking at obmalloc.c, I see one aspect that I'd consider
> questionable: detection of pool-vs-system-malloc uses a heuristics,
> namely
>
> 	offset = (off_t )p & POOL_SIZE_MASK;
> 	pool = (poolp )((block *)p - offset);
> 	if (pool->pooladdr != pool || pool->magic != (uint )POOL_MAGIC) {
> 		_SYSTEM_FREE(p);
> 		return;
> 	}
>
> So if the pool header has the pool magic and points to itself, it
> really is a pool header. That may result in a system malloc being
> recognized as a pool malloc, under rare circumstances. As a result,
> the system heap will be corrupted.

Yes.  I believe Vladimir offered to buy us lunch if that ever happened in a
real program, so he's betting at least US$4 that it won't -- and nothing
else in Python is backed by that much hard cash <wink>.  The tradeoff is
that he gets a lot of comparative storage efficiency out of this trick, and
some speed too (no per-object doubly-linked free lists to maintain, no need
to store size fields).  I don't see a way to make it bulletproof without
giving that up, short of over-allocating "large requests" enough so that a
hidden pool_header struct can be stuffed at a POOL_SIZE-aligned address.
Then the above could become

	offset = (off_t )p & POOL_SIZE_MASK;
	pool = (poolp )((block *)p - offset);
	assert(pool->pooladdr == pool);
	/*  pool->magic member no longer exists; use, e.g.,
	    pool->szidx == (uint)-1 as a flag to mean "this block came
	    from the system malloc".
	*/
	if (pool->szidx == (uint)-1)) {
		_SYSTEM_FREE(p);
		return;
	}

POOL_SIZE is 4KB, though, and that's a lot of wasted space.  OTOH, a "large
object arena" could be introduced too, and carved up much like the current
one.

I suppose the bottom line is that you just can't make address tricks
bulletproof unless you arrange-- by hook or by crook --to control the
addresses you pass out (so you never call malloc at all, or wrap what it
does).

> Another aspect that may be troubling is that obmalloc never returns
> memory to the system.

Ah, but it does for "large objects" obtained via malloc() -- they get
free()d individually.

> It just detects free pools, and is willing to rearrange a free pool for
> use with a different object size. Of course, there is little chance
> that a complete arena will ever become empty, so this is probably not
> that bad.

It's also nothing really new, and *could* lead to an improvement:  the hope
was that PyMalloc could be made fast enough that Python could drop all its
fiddly little type-specific free lists.  Those never return memory either,
and, worse, can't even recycle memory across types; for example, if some
phase of your program uses a million floats at one time, and the next phase
only retains their sum, you're stuck forever with space for a million (minus
1 <wink>) dead float objects, sitting in floatobject.c's block_list.