[Python-Dev] collections module

Sat Jan 10 23:22:21 EST 2004

[Tim]
>> ...
>> We use mildly exponential overallocation when growing a list to
>> ensure amortized O(1) behavior regardless.

[Kurt B. Kaiser]
> The whole game lies in what happens when you resize: how much data are
> you going to copy?

We currently don't copy anything when we resize, although the platform
realloc() may.  To "make room at the left end" we would have to do a
memmove() on a contiguous vector of len(list) pointers.  I don't agree
that's "the whole game", though -- it's a small piece of the game (the point
of using exponential overallaction is to ensure that).  As I said in the
first msg, though, the overallocation gimmick we use now is less effective
if a list can grow and shrink at both ends.

> ...
> If I understand the gimmick, as the queue is used

If a list is used as a queue, yes.

> the ob_item/headpointer will walk down the block and items will be
> appended at the tail until the tail collides with the end of the
> block.  At that point the gimmick approach will memmove the queue to
> place ob_item at the beginning of the block.

Possibly -- nobody has given enough detail to say.  I'd be inclined to
continue doing what we do now when an append runs out of room, which is to
ask for more room, overallocating by a small percentage of the current list
size.  If there's free room "at the left end", though, it may or may not be
more efficient to memmove the guts left.  Those are important details that
haven't gotten beyond the hand-wavy stage.

> Since the block is only a little longer than the queue, that's
> quite a bit of copying, unless I'm missing something.

The amount of copying in *any* case where the speculative gimmick copies is
proportional to the number of elements currenty in the list.  When a list
contains a million items, it's expensive.  If a list contains a few dozen
items, it's cheap.  The current scheme overallocates proportional to the
number of elements in the list, so "a little longer" is "a little" in a
relative sense rather than in an absolute sense (e.g., it's millions of
elements longer if the list contains hundreds of millions of elements --
then you can go through millions of appends after a resize before needing to
do anything fancy again).

> With a circular buffer approach, the headpointer and tailpointer will
> walk down the block and when the end is reached the tailpointer will
> wrap around and "appended" items will be stored at the low end of the
> block.  The process continues without any copying.

So long as it doesn't run out of room.  Eventually it will, and then it has
to move stuff too.  At that level the schemes are the same.  Given specific
common queue access patterns, a queue in steady state will "run out of room"
more often under the contiguous gimmick than the circular one (I gave a
simple example of that before, which in fact is probably a best case for the
circular implementation and a worst case for the other one).

> The position of the pointers is calculated modulo the blocksize, so no
> test/branch is involved.

Integer mod is much worse than test/branch -- Python runs on boxes where an
integer divide takes more than 100 cycles.  So I was assuming the cheaper
way to implement this ("if (physical_pointer >= fence) physical_pointer -=
block_size;"; test-and-conditionally-add/subtract instead of integer mod;
and don't suggest that allocated sizes be restricted to powers of 2 just to
make integer mod efficient (we can't afford that much memory waste)).

> If there is some space at the beginning of the block which has been
> freed up by pop(0), the list will automatically wrap around to fill
> it, even when the pop(0) has been done in a non-queue context.

You can trust that my objection isn't that I don't understand how to
implement a circular queue <wink>.

> With a circular buffer, upon resize the gap between tail and head is
> widened by the amount added to the block.  With a queue, that would on
> average amount to copying half the data, but with a known stride.

I don't know what "known stride" means in this context.  Regardless of
whether a circular implementation is used, a Python list is a contiguous
vector of C pointers, and they're all the same size.  Only these pointers
are moved.  With a reasonably careful circular implementation, needing to
move half the pointers would be the worst case; assuming that the head and
tail are equally likely to bump into each other at each possible position,
the average number of pointer moves would be len/4.

> One could take a conservative approach and rotate the buffer every
> resize so that ob_item is at the beginning of the block, but doing
> that seems to be unnecessary.

Also unhelpful -- every line of C code mucking with lists would still have
to cater to the possibility that the list is spread across two disjoint
memory slices.

...

>> As above, I don't see much to recommend that over the speculative
>> Python list gimmick:  indexing is slower and more complicated,

> pointer arithmetic is cheap compared to memory access.

Let's inject some common sense here.  Indexed list access is an extremely
frequent operation, and cache-friendly sequential indexed list access is
also extremely frequent.  Moving memory around is a rare operation, and
neither scheme can avoid it:  the overwhelmingly most common kind of growing
list will continue to be append-at-the-end, where the circular scheme buys
nothing, but slows down all list operations.  Growing lists are in turn rare
compared to fixed-size lists, where the circular implementation is pure
loss, in both time and space.  The speculative gimmick is a loss in space
there too, but doesn't slow it.

>> resizing is more complicated,

> I'm not sure that's the case.

Neither am I, really <wink>.

> And the gimmick approach has a 'complication' when the tail hits the
> end of the block even when the queue lenth is constant.

That appears to be its only drawback.

>> and all the listobject.c internals would get more complicated by
>> needing to deal with that the list contents are no longer
>> necessarily in a contiguous slice of a C vector (from the C POV, the
>> list contents could be in two disjoint contiguous slices).

> Not if modulo pointer arithmetic is used.

Sorry, but I consider this one too unrealistic to continue talking about.
Start here:

#define PyList_GET_ITEM(op, i) (((PyListObject *)(op))->ob_item[i])
#define PyList_SET_ITEM(op, i, v) \
    (((PyListObject *)(op))->ob_item[i] = (v))

What are you going to replace those with?  They must be fast, and they're
expanded in over 100 places in the core.  Then move on to here (the simplest
function in listobject.c that actually does something <wink>):

/* Reverse a slice of a list in place, from lo up to (exclusive) hi. */
static void
reverse_slice(PyObject **lo, PyObject **hi)
{
	assert(lo && hi);

	--hi;
	while (lo < hi) {
		PyObject *t = *lo;
		*lo = *hi;
		*hi = t;
		++lo;
		--hi;
	}
}

How are you going to rewrite that?  For that matter, how are you going to
change the function *signature* when "a slice of a Python list" can no
longer be represented by a low and high pointer?

When you're done with that, do this one:

/* Merge the na elements starting at pa with the nb elements starting at pb
 * in a stable way, in-place.  na and nb must be > 0, and pa + na == pb.
 * Must also have that *pb < *pa, that pa[na-1] belongs at the end of the
 * merge, and should have na <= nb.  See listsort.txt for more info.
 * Return 0 if successful, -1 if error.
 */
static int
merge_lo(MergeState *ms, PyObject **pa, int na, PyObject **pb, int nb)
{
	int k;
	PyObject *compare;
	PyObject **dest;
	int result = -1;	/* guilty until proved innocent */
	int min_gallop = ms->min_gallop;

	assert(ms && pa && pb && na > 0 && nb > 0 && pa + na == pb);
	if (MERGE_GETMEM(ms, na) < 0)
		return -1;
	memcpy(ms->a, pa, na * sizeof(PyObject*));
	dest = pa;
	pa = ms->a;

	*dest++ = *pb++;
	--nb;
	if (nb == 0)
		goto Succeed;
	if (na == 1)
		goto CopyB;

	compare = ms->compare;
	for (;;) {
		int acount = 0;	/* # of times A won in a row */
		int bcount = 0;	/* # of times B won in a row */

		/* Do the straightforward thing until (if ever) one run
		 * appears to win consistently.
		 */
 		for (;;) {
 			assert(na > 1 && nb > 0);
	 		k = ISLT(*pb, *pa, compare);
			if (k) {
				if (k < 0)
					goto Fail;
				*dest++ = *pb++;
				++bcount;
				acount = 0;
				--nb;
				if (nb == 0)
					goto Succeed;
				if (bcount >= min_gallop)
					break;
			}
			else {
				*dest++ = *pa++;
				++acount;
				bcount = 0;
				--na;
				if (na == 1)
					goto CopyB;
				if (acount >= min_gallop)
					break;
			}
 		}

		/* One run is winning so consistently that galloping may
		 * be a huge win.  So try that, and continue galloping until
		 * (if ever) neither run appears to be winning consistently
		 * anymore.
		 */
		++min_gallop;
		do {
 			assert(na > 1 && nb > 0);
			min_gallop -= min_gallop > 1;
	 		ms->min_gallop = min_gallop;
			k = gallop_right(*pb, pa, na, 0, compare);
			acount = k;
			if (k) {
				if (k < 0)
					goto Fail;
				memcpy(dest, pa, k * sizeof(PyObject *));
				dest += k;
				pa += k;
				na -= k;
				if (na == 1)
					goto CopyB;
				/* na==0 is impossible now if the comparison
				 * function is consistent, but we can't assume
				 * that it is.
				 */
				if (na == 0)
					goto Succeed;
			}
			*dest++ = *pb++;
			--nb;
			if (nb == 0)
				goto Succeed;

 			k = gallop_left(*pa, pb, nb, 0, compare);
 			bcount = k;
			if (k) {
				if (k < 0)
					goto Fail;
				memmove(dest, pb, k * sizeof(PyObject *));
				dest += k;
				pb += k;
				nb -= k;
				if (nb == 0)
					goto Succeed;
			}
			*dest++ = *pa++;
			--na;
			if (na == 1)
				goto CopyB;
 		} while (acount >= MIN_GALLOP || bcount >= MIN_GALLOP);
 		++min_gallop;	/* penalize it for leaving galloping mode */
 		ms->min_gallop = min_gallop;
 	}
Succeed:
	result = 0;
Fail:
	if (na)
		memcpy(dest, pa, na * sizeof(PyObject*));
	return result;
CopyB:
	assert(na == 1 && nb > 0);
	/* The last element of pa belongs at the end of the merge. */
	memmove(dest, pb, nb * sizeof(PyObject *));
	dest[nb] = *pa;
	return 0;
}

There are thousands of additional lines that assume a list is one contiguous
vector of pointers, and that the dead-simple and killer-fast ordinary
contiguous-vector C idioms work.

>> It would have the advantage that resizing is never needed until
>> you're wholly out of room, but that seems minor compared to the
>> drawbacks.

Just thought I'd repeat that <wink>.