[Python-Dev] pymalloc killer

Fri, 29 Mar 2002 05:32:32 -0500

[Tim]
> Recall that pymalloc delegates requests for "not small" blocks to
> the system malloc.  This means that when pymalloc's free() is called,
> it has to figure out whether the address passed to it was one it
> allocated itself, or came from the system malloc.
> ...
> However, the question remained whether a hostile user could study
> the source code to mount a successful attack.  Alas, it turns out
> that's easy.

There may be hope for this, and another strong reason for pursuing it.
Guido pointed out that *if* pymalloc's free() could know for sure whether an
address passed to it did or didn't come from the system malloc, then we
could also (probably -- needs some more thought) continue to use the
PyObject interface, *without* breaking most "illegal by fiat" old code, via
also mapping PyMem_{Free, FREE, Del, DEL} to the pymalloc free (when
pymalloc is enabled):  then the various PyMem_NukeIt spellings would *know*
whether to pass a piece of memory on to the system free().

So I have a variant of pymalloc that doesn't use magic cookies -- it knows
"for sure", by maintaining a sorted (by address) contiguous vector of the
base addresses of the (256KB each) "arenas" allocated by pymalloc.  A given
address A either does or doesn't satisfy

    B <= A < B+256KB

for some base address B in this list.  At worst, a binary search is needed
to see whether any such B exists.  For a typical 2**31 byte VM address
space, we can get at most 2**31/2**18 == 2**13 arenas, so at worst this
vector can grow to a measly 8K entries (typically far fewer), and 13 is a
reasonable upper bound on the number of compares needed.

So it can't be a disaster, but neither is it dirt cheap.  A promising
candidate for saving grace:  in tests so far, frees tend to hit the same
arena many times in a row.  By saving a search finger into the vector that
remembers the last hit index, and checking that location first, I'm seeing
hit rates over 85%.  The two-sided range test at the finger index can be
inlined, and the two compares it requires are exactly as many compares as
the current magic-cookie tests do.  So, most of the time, it may run as fast
as now.

For an address pymalloc *doesn't* handle on its own, though, a full binary
search is required to discover that it's not a pymalloc address.  Leaving at
least one magic-cookie gimmick in could slash that (if the magic cookie
isn't there, we know for sure it's not a pymalloc address).  So the test
becomes something like

def PyMalloc_Free(thisaddr):
    # ... weed out NULL ...

    # not NULL
    if (addrs[finger] <= thisaddr < addrs[finger] + 256KB
        or (thisaddr_has_magic_cookie and binary_search_for(thisaddr)):
        it's a pymalloc addr
    else
        it goes to the system free()

That gives a likely fast-path for each class of address, and should never
err (provided we're passed legitimate addresses, and user code doesn't write
out of bounds, etc).

A nasty subtlety:  if PyMem_NukeIt is called when the GIL isn't held, it's
going to take a lot of thought to see whether this can be made safe when
called simultaneously by multiple threads; e.g., the finger can change
"mid-stream" then, and, worse, some thread that *does* hold the GIL may
legitimately cause a new arena to be allocated midstream, mutating the
address vector during the search.  This may be intractable enough to kill
the idea of mapping PyMem_XXX to this.

A nasty trick:  due to the wonders of unsigned arithmetic, the two-sided
range test can be reduced to one compare:

    (Py_uintptr)thisaddr - (Py_uintptr)addrs[finger] < 256KB

In other news, "it's obvious" that the small object threshhold is too small
for 2.3.  I'm still inclined to max it out.