[Python-Dev] Proposal to eliminate PySet_Fini

Tim Peters tim.peters at gmail.com
Mon Jul 3 10:14:42 CEST 2006


[Jack Diederich]
>> PyObject_MALLOC does a good job of reusing small allocations but it
>> can't quite manage the same speed as a free list, especially for things that
>> have some extra setup involved (tuples have a free list for each length).

[Martin v. Löwis]
> I would question that statement, for any practical purposed. The cost of
> tuple comes from setting the elements to NULL, and that has to be done
> regardless of whether they were allocated new or came from the list.

Except _that_ overhead is trivial for small tuples, and small tuples
are the only kind the free lists cache.  There are many other
overheads.  If a tuple is taken off a free list, we get to skip
integer multiplication and division checking for overflow before
calling PyObject_GC_NewVar.  We also get to skip the call to
PyObject_GC_NewVar.  That in turns skips another integer
multiplication in the _PyObject_VAR_SIZE macro, and a call to
_PyObject_GC_Malloc.  That it turn skips a call to PyObject_MALLOC,
and conditionals checking whether it's time to trigger a gc
collection.  All of that is highly significant compared to the cost of
setting at most a handful of slots to NULL inline.

> Likewise, the GC management has to be done regardless.

_PyObject_GC_TRACK expands to 5 inlined simple stores, and a
predictable branch, so it is often more expensive than setting the
tuple slots to NULL.  But, as above, we get to skip three layers of
function call and "will it overflow?" arithmetic in the service of
_setting up_ an object for gc initially.  Only the gc track/untrack
pair remains for tuples in a free list.

> So I expect that the speedup is rather minor, and not worth it.

Depends on the app :-)  Here's a test case that gets supernatural
benefit from small-tuple caching:

"""
def doit():
    N1000 = [None] * 1000
    basetup = (5,)
    for i in N1000:
        tups = []
        push = tups.append
        for j in xrange(10):
            for k in N1000:
                push(basetup * j)

from time import clock as now
times = []
for i in range(3):
    start = now()
    doit()
    finish = now()
    times.append(finish - start)
print sorted(times)
"""

With current trunk that printed

[2.9363677646013846, 2.9489729031005703, 2.9689538729183949]

After changing

#define MAXSAVEDTUPLES  2000

to

#define MAXSAVEDTUPLES  0

the times zoomed to

[4.5894824930441587, 4.6023111649343242, 4.629560027293957]

That's pretty dramatic.  OTOH, I don't have any apps that do that <0.5
wink>, and there's another downside:  on SF recently, someone
complained that the 2.5 obmalloc work to release unused arenas wasn't
doing much in his (perhaps equally artificial -- don't know) test
case.  It surprised me too, so I dug into it.  The problem turned out
to be that piles of arenas were being kept "artificially" alive
because obmalloc was the original source of thousands of tuple objects
being kept alive (from obmalloc's POV) in tupleobject.c's free lists.
If you're unlucky, it only takes one tiny tuple in one free list to
keep an entire 256KB arena alive -- and if you're very unlucky, you
manage to allocate objects in such a way that this happens repeatedly.
 His test also created lots of dicts along the way, and each arena got
mostly filled up with dict objects and a relative handful of small
tuples.  By the time all of this became trash, the tuples were spread
over a few hundred areans, which effectively became immortal.

While it doesn't make any real sense, I've seen repeatedly that users
_try_ calling gc.collect() in such cases (doesn't make sense because
it has nothing to do with cyclic gc).  But that suggests it could be a
_pragmatic_ win to add an internal "clear all known free lists"
function, which gc could call at times, or even just be forced by the
user via a `gc` entry point or gc.collect() option.

The immortal & unbounded int and float free lists don't cause obmalloc
arenas to stay alive unduly, because they get their memory in chunks
straight from the system malloc().  But they have to be the cause of
the most frequent source of complaints from newbies and Zope users
;-), who do things like

    range(10000000)

and then marvel that some 120MB of VM is still being used after nuking
the list and doing a gc.collect().  I get weary of explaining that one
:-(.


More information about the Python-Dev mailing list