Mailman 3 Re: [Python-Dev] collections module - Python-Dev

Re: [Python-Dev] collections module

older
RE: [Python-Dev] collections module

Josiah Carlson

Jan. 9, 2004

7:15 a.m.

None of those can compete with my proposed C implementation which pre-allocates around 50 cells at a time and does it's read/writes through array lookups. It avoids all the expensive resizing of list and uses C ints instead of PyInts.

That's pretty crazy. If I had a say, I'd be +1 just because I've seen so many people using their own {append, pop(0)} queues before. - Josiah P.S. I know knuth had it first, but one never knows how well read anyone else is. It is better to err on the side of giving too much information, than too little.

Show replies by date

Zack Weinberg

January 2004

7:23 a.m.

New subject: collections module

Josiah Carlson <jcarlson@uci.edu> writes:

...

...
None of those can compete with my proposed C implementation which pre-allocates around 50 cells at a time and does it's read/writes through array lookups. It avoids all the expensive resizing of list and uses C ints instead of PyInts.

That's pretty crazy. If I had a say, I'd be +1 just because I've seen so many people using their own {append, pop(0)} queues before.

As a user of Python, I am +1 on making there be a *really really fast* queue data structure somewhere, but -0 on its being something other than the existing Queue class. Can't we just optimize the one we've got? Ideally without losing the thread safety guarantees (but making it usable in an unthreaded program would be nice)? zw

Guido van Rossum

4:10 p.m.

New subject: collections module

[Raymond]

...

...
...
None of those can compete with my proposed C implementation which pre-allocates around 50 cells at a time and does it's read/writes through array lookups. It avoids all the expensive resizing of list and uses C ints instead of PyInts.

...

Josiah Carlson <jcarlson@uci.edu> writes:

...
That's pretty crazy. If I had a say, I'd be +1 just because I've seen so many people using their own {append, pop(0)} queues before.

[Zack]

...

As a user of Python, I am +1 on making there be a *really really fast* queue data structure somewhere, but -0 on its being something other than the existing Queue class. Can't we just optimize the one we've got? Ideally without losing the thread safety guarantees (but making it usable in an unthreaded program would be nice)?

You miss one thing -- the Queue class exists *specifically* for thread communication. It has very different semantics than I'd want for a generic queue. E.g. with Queue, if you try to get something out of an empty queue, your thread *blocks*. That's only appropriate for applications that expect this. (Imagine the newbie questions to c.l.py. :-) My own ideal solution would be a subtle hack to the list implementation to make pop(0) and insert(0, x) go a lot faster -- this can be done by adding one or two extra fields to the list head and allowing empty space at the front of the list as well as at the end. --Guido van Rossum (home page: http://www.python.org/~guido/)

Josiah Carlson

8:51 p.m.

New subject: collections module

...

You miss one thing -- the Queue class exists *specifically* for thread communication. It has very different semantics than I'd want for a generic queue. E.g. with Queue, if you try to get something out of an empty queue, your thread *blocks*. That's only appropriate for applications that expect this. (Imagine the newbie questions to c.l.py. :-)

Agreed. About the only question is what the new fifo queue should be called, and wether it should be located in Queue (probably not, given the module description), Raymond's proposed high speed data structure module, or somewhere else.

...

My own ideal solution would be a subtle hack to the list implementation to make pop(0) and insert(0, x) go a lot faster -- this can be done by adding one or two extra fields to the list head and allowing empty space at the front of the list as well as at the end.

I'm sure you know this, but just for sake of argument, how many extra fields? A couple? A few? 20? 30? I'm not sure there really is a good hard and fast number. I think it makes as much sense to preallocate the same number of entries to the front of a list, as to what is currently allocated to the back. At that point, you can guarantee O(1) {pop(0), pop(), append() and insert(0)} behavior, at the cost of slightly more memory. Then it really wouldn't matter how people implement their stacks or queues (front, back, double-list), it would still be fast. - Josiah

Tim Peters

10:03 p.m.

New subject: collections module

[Guido]

...

...
My own ideal solution would be a subtle hack to the list implementation to make pop(0) and insert(0, x) go a lot faster -- this can be done by adding one or two extra fields to the list head and allowing empty space at the front of the list as well as at the end.

[Josiah Carlson]

...

I'm sure you know this, but just for sake of argument, how many extra fields? A couple? A few? 20? 30?

As Guido said, "one or two" would be enough. I think you're talking about different things, though: Guido is talking about the number of new named struct members that would have to be added to Python's list object header. I think you're talking about how many array slots to leave unused on both ends. Just as now, on a reallocation the number of free slots asked for has to be proportional to the total number of slots, so that growth via a huge number of one-at-a-time appends takes (amortized) constant-time per append. Picking any fixed number of free slots leads to overall quadratic-time (in the number of appends) behavior.

Guido van Rossum

10:10 p.m.

New subject: collections module

...

...
My own ideal solution would be a subtle hack to the list implementation to make pop(0) and insert(0, x) go a lot faster -- this can be done by adding one or two extra fields to the list head and allowing empty space at the front of the list as well as at the end.

I'm sure you know this, but just for sake of argument, how many extra fields?

You misunderstood -- the *fields* I was talking about are counters or pointers that keep track of the overallocation, and 1 or at most 2 is enough.

...

A couple? A few? 20? 30? I'm not sure there really is a good hard and fast number.

You're talking about the amount of overallocation. I would propose to overallocate exactly as much as the list implementation currently does, which is a very subtle scheme that overallocates more when the list grows larger to maintain logarithmic behavior.

...

I think it makes as much sense to preallocate the same number of entries to the front of a list, as to what is currently allocated to the back. At that point, you can guarantee O(1) {pop(0), pop(), append() and insert(0)} behavior, at the cost of slightly more memory. Then it really wouldn't matter how people implement their stacks or queues (front, back, double-list), it would still be fast.

Right. --Guido van Rossum (home page: http://www.python.org/~guido/)

Tim Peters

5:26 a.m.

New subject: collections module

[Guido]

...

You misunderstood -- the *fields* I was talking about are counters or pointers that keep track of the overallocation, and 1 or at most 2 is enough.

One would be enough. To avoid breaking huge mounds of code all over the place, I think the existing ob_item should "float", to point always to the current element "at index 0". So ob_item would change when inserting or deleting "at the left end". The only new thing you need then is a field recording the number of unused slots "left-side" slots, between the start of the memory actually allocated and the slot ob_item points at; that field would have to be added or subtracted to/from ob_item in the obvious ways when allocating or deallocating the raw memory, but *most* of listobject.c could ignore it, and none of the PyList macros would need to change (in particular, the speed of indexed access wouldn't change). The overallocation strategy gets more complicated, though, and less effective for prepending. If you run out of room when prepending, it's not enough just to ask realloc() for more memory, you also have to memmove() the entire vector, to "make room at the left end". In previous lives I've found that "appending at the right" is as efficient as it is in practice largely because most non-trivial C realloc() calls end up extending the vector in-place, not needing to copy anything. The O() behavior is the same if each boost in size has to move everything that already existed, but the constant factor goes up more than just measurably <wink>.

kbk＠shore.net

7:58 a.m.

New subject: collections module

"Tim Peters" <tim.one@comcast.net> writes:

...

[Guido]

...
You misunderstood -- the *fields* I was talking about are counters or pointers that keep track of the overallocation, and 1 or at most 2 is enough.

One would be enough. To avoid breaking huge mounds of code all over the place, I think the existing ob_item should "float", to point always to the current element "at index 0". So ob_item would change when inserting or deleting "at the left end". The only new thing you need then is a field recording the number of unused slots "left-side" slots, between the start of the memory actually allocated and the slot ob_item points at; that field would have to be added or subtracted to/from ob_item in the obvious ways when allocating or deallocating the raw memory, but *most* of listobject.c could ignore it, and none of the PyList macros would need to change (in particular, the speed of indexed access wouldn't change).

The overallocation strategy gets more complicated, though, and less effective for prepending. If you run out of room when prepending, it's not enough just to ask realloc() for more memory, you also have to memmove() the entire vector, to "make room at the left end". In previous lives I've found that "appending at the right" is as efficient as it is in practice largely because most non-trivial C realloc() calls end up extending the vector in-place, not needing to copy anything. The O() behavior is the same if each boost in size has to move everything that already existed, but the constant factor goes up more than just measurably <wink>.

Aren't array-based implementations of queue usually based on circular buffers? Head and tail pointers plus a count are sufficient. Then a push is: check count (resize if necessary), write, adjust head pointer, adjust count. A pull is similar, using the tail pointer. Constant time. The resize is a little tricky because the gap between the tail and head pointers is what needs to grow. The current implementation of Queue (with what appears to me to be a listpop(v, 0) for get() ) involves an internal shift of the whole pointer vector, O(n), for every read of the queue. The Cookbook examples mentioned alleviate this but still with excess copying or unbounded memory usage. IMHO it's better done in listobject. I'm not up to speed on the reasons why ob_item needs to float but perhaps there is a straightforward indirect relationship to the head pointer. Why not have listobject grow deque functionality using circular buffer techniques? That way you get high performance queues and deques while avoiding the name collision with Queue. put(x) == insert(0, x) get() == pop() put_tail(x) == append(x) == push(x) (?? this must have been discussed)) get_head() == pop(0) All list operations would preserve the head and tail pointers. -- KBK

Tim Peters

8:49 p.m.

New subject: collections module

[Kurt B. Kaiser]

...

Aren't array-based implementations of queue usually based on circular buffers?

Yes, but those are generally bounded queues.

...

Head and tail pointers plus a count are sufficient. Then a push is: check count (resize if necessary), write, adjust head pointer, adjust count.

By analogy to a real-life queue (of, say, people), pushing is usually viewed as adding to the tail (when you get in line for movie tickets, you're likely to get punched if you add yourself to the head of the queue). I have to insist on using that terminology, just because it ties my head in knots to swap the usual meanings.

...

A pull is similar, using the tail pointer. Constant time.

So is the speculative Python list gimmick we're talking about, until it runs out of room. We use mildly exponential overallocation when growing a list to ensure amortized O(1) behavior regardless.

...

The resize is a little tricky because the gap between the tail and head pointers is what needs to grow.

Resizing is easier with the speculative Python list gimmick, and 0-based indexing (where list[0] is the current head) is identical to the list-indexing implementation Python already has. The latter is important because it's fast. Needing to "wrap around" an index to account for circularity would introduce new test+branch expense into indexing.

...

The current implementation of Queue (with what appears to me to be a listpop(v, 0) for get() ) involves an internal shift of the whole pointer vector, O(n), for every read of the queue. The Cookbook examples mentioned alleviate this but still with excess copying or unbounded memory usage. IMHO it's better done in listobject.

Queue.py and the speculative Python list gimmick are different issues. The overhead of mucking with locks, and Python-level method calls, in Queue.py surely overwhelms whatever savings you could get over the two-list queue implementation Josiah put in his Cookbook comment. The only "extra" expense there over a straight push-and-pop-both-at-the-right-end Python list is that each element added to the queue is involved in (no more than) one C-level pointer swap (due to the reverse() call, when the input list is turned into the output list). If the speculative Python list gimmick were implemented, Queue.py could, of course, change to use it.

...

I'm not up to speed on the reasons why ob_item needs to float

So that indexing a Python list is as fast as it is today, and so that a mountain of C code continues to work (all C code mucking with Python lists now believes that a list contains ob_size elements, starting at ob_items and ending at ob_items+ob_size-1).

...

but perhaps there is a straightforward indirect relationship to the head pointer.

ob_item would *be* the head pointer (for my meaning for "head"), whenever the list is non-empty.

...

Why not have listobject grow deque functionality using circular buffer techniques?

As above, I don't see much to recommend that over the speculative Python list gimmick: indexing is slower and more complicated, resizing is more complicated, and all the listobject.c internals would get more complicated by needing to deal with that the list contents are no longer necessarily in a contiguous slice of a C vector (from the C POV, the list contents could be in two disjoint contiguous slices). It would have the advantage that resizing is never needed until you're wholly out of room, but that seems minor compared to the drawbacks.

kbk＠shore.net

11:21 p.m.

New subject: collections module

"Tim Peters" <tim.one@comcast.net> writes:

...

[Kurt B. Kaiser]

...
Aren't array-based implementations of queue usually based on circular buffers?

Yes, but those are generally bounded queues.

And Pythonic versions would not be :-)

...

...
Head and tail pointers plus a count are sufficient. Then a push is: check count (resize if necessary), write, adjust head pointer, adjust count.

By analogy to a real-life queue (of, say, people), pushing is usually viewed as adding to the tail (when you get in line for movie tickets, you're likely to get punched if you add yourself to the head of the queue). I have to insist on using that terminology, just because it ties my head in knots to swap the usual meanings.

Reading other posts in the thread and further reflection had already converted me to that point of view. We are agreed! Put to the tail and get from the head. Also, agreed that ob_item _is_ the headpointer.

...

...
A pull is similar, using the tail pointer. Constant time.

So is the speculative Python list gimmick we're talking about, until it runs out of room. We use mildly exponential overallocation when growing a list to ensure amortized O(1) behavior regardless.

The whole game lies in what happens when you resize: how much data are you going to copy?

...

...
The resize is a little tricky because the gap between the tail and head pointers is what needs to grow.

Resizing is easier with the speculative Python list gimmick, and 0-based indexing (where list[0] is the current head) is identical to the list-indexing implementation Python already has. The latter is important because it's fast. Needing to "wrap around" an index to account for circularity would introduce new test+branch expense into indexing.

If I understand the gimmick, as the queue is used the ob_item/headpointer will walk down the block and items will be appended at the tail until the tail collides with the end of the block. At that point the gimmick approach will memmove the queue to place ob_item at the beginning of the block. Since the block is only a little longer than the queue, that's quite a bit of copying, unless I'm missing something. With a circular buffer approach, the headpointer and tailpointer will walk down the block and when the end is reached the tailpointer will wrap around and "appended" items will be stored at the low end of the block. The process continues without any copying. The position of the pointers is calculated modulo the blocksize, so no test/branch is involved. If there is some space at the beginning of the block which has been freed up by pop(0), the list will automatically wrap around to fill it, even when the pop(0) has been done in a non-queue context. With a circular buffer, upon resize the gap between tail and head is widened by the amount added to the block. With a queue, that would on average amount to copying half the data, but with a known stride. One could take a conservative approach and rotate the buffer every resize so that ob_item is at the beginning of the block, but doing that seems to be unnecessary. The question of which way to rotate the buffer into the new free space seems to be irrelevant in the case of a queue/deque, but may be interesting for ordinary lists. I suspect pushing the tail to the left is optimum to minimize copying. [...]

...

...
Why not have listobject grow deque functionality using circular buffer techniques?

As above, I don't see much to recommend that over the speculative Python list gimmick: indexing is slower and more complicated,

pointer arithmetic is cheap compared to memory access.

...

resizing is more complicated,

I'm not sure that's the case. And the gimmick approach has a 'complication' when the tail hits the end of the block even when the queue lenth is constant.

...

and all the listobject.c internals would get more complicated by needing to deal with that the list contents are no longer necessarily in a contiguous slice of a C vector (from the C POV, the list contents could be in two disjoint contiguous slices).

Not if modulo pointer arithmetic is used.

...

It would have the advantage that resizing is never needed until you're wholly out of room, but that seems minor compared to the drawbacks.

-- KBK

Martin v. Loewis

11:34 p.m.

New subject: collections module

Kurt B. Kaiser wrote:

...

With a circular buffer approach, the headpointer and tailpointer will walk down the block and when the end is reached the tailpointer will wrap around and "appended" items will be stored at the low end of the block. The process continues without any copying.

I think you two are talking about completely different things. Tim is talking about a modification to the builtin list type. You are talking about a new data type for queues. You are both right: Changing the list type to leave space in the front would make it more efficient for implementing queues than the current list type, especially if people use the standard idiom. Using a new type would be even more efficient if the queue is bounded. I'm still opposed to adding a new queue type, because it complicates the library, and it does not give any benefit to existing code - one actually has to change the existing implementations of queues. People don't want to change their code, especially if they need to support old Python versions. So changing the list type looks more promising to me. Not taking any action might be acceptable as well. Regards, Martin

kbk＠shore.net

12:08 a.m.

New subject: collections module

"Martin v. Loewis" <martin@v.loewis.de> writes:

...

I think you two are talking about completely different things.

Tim is talking about a modification to the builtin list type.

You are talking about a new data type for queues.

I am talking about a modification to listobject.c, also. The discussion is related to implementation and associated efficiencies. I'm -1 on adding a new queue dataype. I'm +1 on adding queue/deque functionality to the list datatype. -- KBK

Martin v. Loewis

8:54 a.m.

New subject: collections module

Kurt B. Kaiser wrote:

...

I am talking about a modification to listobject.c, also.

So you suggest that the list type become circular??? That will cause a major ABI incompatibility, as all modules that use PyList_GET_SIZE need to be recompiled. In addition, it causes a slow-down for the standard use of lists. So you would make the general case slower, just to support a special case. Why are you proposing that? Regards, Martin

Tim Peters

6:30 p.m.

New subject: collections module

[Kurt B. Kaiser]

...

...
I am talking about a modification to listobject.c, also.

[Martin v. Loewis]

...

So you suggest that the list type become circular???

That's my understanding, but already belabored at length <wink>.

...

That will cause a major ABI incompatibility, as all modules that use PyList_GET_SIZE need to be recompiled.

Well, the macros aren't supposed to be used outside the core -- external extensions are supposed to be using the representation-hiding PyList_Size, PyList_GetItem, PyList_SetItem functions. If we keep it contiguous, users of the macros may need to be recompiled anyway, since the layout of the list header would change.

...

In addition, it causes a slow-down for the standard use of lists.

That's one of my two major objections. The other is the enormous amount of code changes that would be required throughout listobject.c -- all "the usual" C idioms for mucking with vectors would no longer work correctly as-is, whether it's the typical "for (p = list->ob_item: p < fence; ++p)" loop, or use of library functions like memmove and memcpy. That's a ton of additional complication, with obvious bad implications for transparency, stability, maintainability, code bloat, and efficiency.

...

So you would make the general case slower, just to support a special case. Why are you proposing that?

Kurt's worried about the expense of memmove(), but not about the expense of integer mod. To be fair(er <wink>>), I'm not convinced that anyone has dreamt up *detailed* suggestions sufficient to allow common queue access patterns to be reasonably zippy. Think about this one: q = [incoming() for i in range(1000)] while True: q.append(incoming()) consume(q.pop(0)) That's just a queue in steady-state, with a backlog of 1000 but thereafter with balanced entry and exit rates. We currently realloc() q on every append(), rounding up its size, where "its size" is taken from ob_size. That would no longer be *correct*: ob_size stays steady at 1000, but append needs ever-larger indices relative to the start of the allocated memory. IOW, while ob_size bears a useful relationship to the amount of memory we actually allocate tody, that relationship would no longer hold (ob_size would become an arbitrarily poor lower bound). So what are we going to do instead, exactly? We actually allocate enough room for 1024 elements now (although Python doesn't currently remember that). If we start to remember when we're out of room, and memmove() the whole blob left 24 slots when append hits the end of available room, then every 24 (+-1 -- doesn't matter at this level) iterations of the loop, we end up memmove'ing 1000 pointers. If the steady-state backlog were, say, 1023 elements instead, we'd end up memmove'ing 1023 pointers on *every* loop iteration, and each append becomes a linear-time operation. OTOH, if we "round up the size" based on the number of allocated (not used) slots, then we get back amortized O(1) time per append -- but the amount of allocated memory needed by the example grows without bound, despite that the number of used slots remains constant. No "pure" strategy I've seen seems good enough. A circular implementation has no problem with the example, or, indeed, with any queue access pattern, although a circular implementation would be reluctant to reduce the amount of allocated space when the number of used slots decreases (since the unused slots are "a hole in the middle" of the allocated slots, and we can't give back a hole to the system malloc).

kbk＠shore.net

10:10 p.m.

New subject: collections module

"Martin v. Loewis" <martin@v.loewis.de> writes:

...

That will cause a major ABI incompatibility, as all modules that use PyList_GET_SIZE need to be recompiled. In addition, it causes a slow-down for the standard use of lists. So you would make the general case slower, just to support a special case. Why are you proposing that?

I'm not proposing it, I'd have to provide an implementaton to do that :-) I just asked the question, "Why not a circular buffer implementation?" Tim has been patient enough to get my education on that subject started. ob_size should continue to be the number of pointers to objects. The GET_ITEM/SET_ITEM macros are more problematic in terms of overhead because an additional memory access would be required to get the current size of the block containing the pointer vector. As Tim said, other methods under consideration could also change the ABI. If the overall slowdown of lists to get fast queues was 5%, that would no doubt be unacceptable. If it was 0.5%, I imagine it would be acceptable? Note that if no pop(0)'s are ever done on the list there is no wrap around and the slowdown is limited to the access of blocksize plus a test; no modulo arithmetic is necessary. This is also the case if the list doesn't grow and items are only removed, including list[0]. And in the latter case, the memmoves associated with internal deletions will dominate the performance in all implementations. -- KBK

Tim Peters

12:31 a.m.

New subject: collections module

[Kurt B. Kaiser]

...

... If the overall slowdown of lists to get fast queues was 5%, that would no doubt be unacceptable. If it was 0.5%, I imagine it would be acceptable?

Probably not -- it's too much the tail wagging the dog, as Python lists, used for what they're best at, are core language workhorses already. Timing changes under 10% are very hard to establish convincingly, though, especially because the relative performance of Python features varies so widely across platforms (some mix of the code the platform C compiler generates, the quality of the platform C libraries, and pure accidents, like whether the top of the main eval loop happens to end up fighting severe I-cache conflicts with other high-frequency code).

...

Note that if no pop(0)'s are ever done on the list there is no wrap around and the slowdown is limited to the access of blocksize plus a test; no modulo arithmetic is necessary. This is also the case if the list doesn't grow and items are only removed, including list[0]. And in the latter case, the memmoves associated with internal deletions will dominate the performance in all implementations.

Neither case holds for a queue; the question that started this was whether to provide a fancy queue implementation in C, and (of course) Guido would rather see Python lists efficiently usable for that purpose too. I suspected that might not be a realistic hope, and now I'm starting to *think* it too <wink>.

Raymond Hettinger

1:38 a.m.

New subject: collections module

...

...
Note that if no pop(0)'s are ever done on the list there is no wrap around and the slowdown is limited to the access of blocksize plus a test; no modulo arithmetic is necessary. This is also the case if

[Kurt B. Kaiser] the

...
list doesn't grow and items are only removed, including list[0]. And in the latter case, the memmoves associated with internal deletions will dominate the performance in all implementations.

[Timbot]

...

Neither case holds for a queue; the question that started this was whether to provide a fancy queue implementation in C, and (of course) Guido would rather see Python lists efficiently usable for that purpose too. I suspected that might not be a realistic hope, and now I'm starting to *think* it too <wink>.

It needn't be especially fancy. Here is what I had in mind: n = 50 class fifo(object): def __init__(self): # block[0:n] is for data and block[n] is a link field self.head = self.tail = [None] * (n+1) self.headptr = self.tailptr = 0 def push(self, x): if self.headptr == n: newblock = [None] * (n+1) self.head[n] = newblock self.head = newblock self.headptr = 0 self.head[self.headptr] = x self.headptr += 1 def pop(self): if self.tail is self.head and self.tailptr >= self.headptr: raise IndexError if self.tailptr == n: self.tail = self.tail[n] self.tailptr = 0 x = self.tail[self.tailptr] self.tailptr += 1 return x The memory manager is called once every for 50 queue insertions and memory is freed after every 50 pops. Outside of that, every push and pop just a fast array access and pointer increment. Long queues incur about a 15-20% space overhead as compared to a straight-list implementation. Speedwise, this beats the socks off of the current approach. Raymond Hettinger ################################################################# ################################################################# ################################################################# ##### ##### ##### ################################################################# ################################################################# ################################################################# ################################################################# ################################################################# ################################################################# ##### ##### ##### ################################################################# ################################################################# #################################################################

Tim Peters

2:42 a.m.

New subject: collections module

[Tim]

...

...
Neither case holds for a queue; the question that started this was whether to provide a fancy queue implementation in C, and (of course) Guido would rather see Python lists efficiently usable for that purpose too. I suspected that might not be a realistic hope, and now I'm starting to *think* it too <wink>.

[Raymond Hettinger]

...

It needn't be especially fancy. Here is what I had in mind:

n = 50 class fifo(object):

This is a different issue. There are many ways to *use* lists to build a distinct fifo class. What I've been writing about is the possibility of making Python lists *directly* efficient for use as queues (the realistic possibility of making list.append and list.pop(0) both at worst O(1) amortized time). That's what I remain skeptical about.

...

def __init__(self): # block[0:n] is for data and block[n] is a link field self.head = self.tail = [None] * (n+1)

That *looks* so much like a common newbie coding error that it deserves a comment.

...

self.headptr = self.tailptr = 0

def push(self, x): if self.headptr == n: newblock = [None] * (n+1) self.head[n] = newblock self.head = newblock self.headptr = 0 self.head[self.headptr] = x self.headptr += 1

Please don't push at "the head" -- this reverses the common (and sensible) meanings of "head" and "tail" for queues. You join a queue at its tail, and leave the queue when you reach its head.

...

def pop(self): if self.tail is self.head and self.tailptr >= self.headptr: raise IndexError if self.tailptr == n: self.tail = self.tail[n] self.tailptr = 0 x = self.tail[self.tailptr] self.tailptr += 1 return x

The memory manager is called once every for 50 queue insertions and memory is freed after every 50 pops. Outside of that, every push and pop just a fast array access and pointer increment. Long queues incur about a 15-20% space overhead as compared to a straight-list implementation. Speedwise, this beats the socks off of the current approach.

Meaning pairing append() and pop(0)? Sure -- it would be hard not to beat that <wink>. You can save more by ensuring the blocks are small enough for pymalloc to satisfy the memory requests, more again by keeping a dedicated freelist of nodes, and more again by switching to a circular implementation when the queue appears to have reached steady-state. You might also look at the implementation of the Icon language, where the native list type is implemented similarly as a linked list of blocks, and supports efficient general deque operation. Note that Marc-Andre's mxQueue implements a queue as a circular dynamic array; when the queue grows at an enormous rate, he has to endure some reallocs and moving, but in steady-state it can run forever without any memory-management operations. I'll note it looks like an mxQueue never shrinks, though, which may or may not be "a problem", depending on the app. As to "fancy", Marc-Andre ended up with over 1,000 lines of C, and his implementation takes a simpler approach than yours. OTOH, it added lots of marginal features (like "q << x" adds x to q). I've never needed a queue in Python where "the standard" two-list trick (as shown in Josiah's Cookbook comment) wasn't more than good enough, so my interest in this is really limited to whether Python's native list type can (realistically) be made efficient for general deque operation.

Raymond Hettinger

5:32 a.m.

New subject: collections module

...

...
The memory manager is called once every for 50 queue insertions and memory is freed after every 50 pops. Outside of that, every push and pop just a fast array access and pointer increment. Long queues incur about a 15-20% space overhead as compared to a straight-list implementation. Speedwise, this beats the socks off of the current approach.

Meaning pairing append() and pop(0)? Sure -- it would be hard not to beat that <wink>.

...

I've never needed a queue in Python where "the standard" two-list

No, I meant that it is *much* faster to do an indexed write to a pre-allocated list than to use anything relying on PyList_Append() or ins1(). All of those individual calls to realloc are expensive. The 50 to 1 dilution of that expense is how it beats Knuth's two stack trick. Guess which of these two statements is faster and by how much: list(itertools.repeat(None, 5000)) list(tuple(itertools.repeat(None, 5000))) The answer is important because it points the way to faster list comprehensions and other list building applications. trick

...

(as shown in Josiah's Cookbook comment) wasn't more than good enough, so my interest in this is really limited to whether Python's native list type can (realistically) be made efficient for general deque operation.

And my interest is in a collections module which should probably published elsewhere (too much negativity on the subject here). I am frankly surprised that so many here think that a collections module would be bad for the language. Go figure. It's not like these are new, untried ideas -- the goal was supposed be improving productivity by providing higher level tools. A fast underlying implementation was just icing on the cake. Raymond Hettinger ################################################################# ################################################################# ################################################################# ##### ##### ##### ################################################################# ################################################################# #################################################################

Aahz

4:28 p.m.

New subject: collections module

On Mon, Jan 12, 2004, Raymond Hettinger wrote:

...

I am frankly surprised that so many here think that a collections module would be bad for the language. Go figure. It's not like these are new, untried ideas -- the goal was supposed be improving productivity by providing higher level tools. A fast underlying implementation was just icing on the cake.

I'm suspecting that most of the negativity came from the lack of use cases for bags. Let me suggest an alternate idea: in the past, some people have suggested a "data structures" module to give reference implementations of standard data structures and algorithms. This would be a tool aimed more at making it easy to learn computer science with Python than providing the highest-speed quirky implementations; the focus would be on documenting the code. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ A: No. Q: Is top-posting okay?

Tim Peters

9:17 p.m.

New subject: collections module

[Raymond Hettinger]

...

No, I meant that it is *much* faster to do an indexed write to a pre-allocated list than to use anything relying on PyList_Append() or ins1(). All of those individual calls to realloc are expensive. The 50 to 1 dilution of that expense is how it beats Knuth's two stack trick.

That's fine. As I said before, I've never had a use for a queue in Python where the two-stack trick (coded in Python, even) wasn't more than fast enough for me. If I did, there are a universe of possibilities, including Marc-Andre's mxQueue. That doesn't use Python lists; doesn't call realloc on every queue put (only when it's out of room, which is "never" when a queue reaches steady state); and, when it does realloc, overallocates by a much larger amount than Python's list type.

...

Guess which of these two statements is faster and by how much:

list(itertools.repeat(None, 5000)) list(tuple(itertools.repeat(None, 5000)))

I did guess, and was correct to 6 significant decimal digits <wink>.

...

The answer is important because it points the way to faster list comprehensions and other list building applications.

That Python's builtin list calls realloc() on every append is pretty gross, but it's also a tradeoff, saving the bytes in the list header that would be required to remember how many allocated-but-unused slots exist. If you've got a million tiny lists (and some apps do), burning an additional 8MB for something your app doesn't really need isn't attractive (on a 32-bit box today, the list header is 16 bytes; pymalloc aligns to 8-byte boundaries, so the next possible size is 24 bytes). list's very small overallocation (in the ballpark of 5%) is also aimed at minimizing memory burden. Speed isn't everything, tradeoffs are hard, and something loses when something else is favored. (BTW, I never would have considered implementing a resizable list type *without* keeping a field to record the amount of overallocation, so this isn't the tradeoff I would have made <wink>.)

...

... And my interest is in a collections module which should probably published elsewhere (too much negativity on the subject here).

A collections *package* is a fine idea. I say "package" instead of "module" because there are literally hundreds of container types, and queues and bags are a tiny corner of that space. An example dear to my heart is BTrees (I maintain Zope's BTree implementation), which is a wonderful way to get an associative mapping sorted by key value. BTrees have a large API all by themselves, so it wouldn't make sense to try to squash them into a module along with 50 other container types.

...

I am frankly surprised that so many here think that a collections module would be bad for the language.

I was more of the impression that bags didn't have fans <wink>.

...

Go figure. It's not like these are new, untried ideas -- the goal was supposed be improving productivity by providing higher level tools. A fast underlying implementation was just icing on the cake.

I'm strongly in favor of a collections package.

Guido van Rossum

9:43 p.m.

New subject: collections module

...

...
Guess which of these two statements is faster and by how much:

list(itertools.repeat(None, 5000)) list(tuple(itertools.repeat(None, 5000)))

I did guess, and was correct to 6 significant decimal digits <wink>.

...
The answer is important because it points the way to faster list comprehensions and other list building applications.

That Python's builtin list calls realloc() on every append is pretty gross, but it's also a tradeoff, saving the bytes in the list header that would be required to remember how many allocated-but-unused slots exist. If you've got a million tiny lists (and some apps do), burning an additional 8MB for something your app doesn't really need isn't attractive (on a 32-bit box today, the list header is 16 bytes; pymalloc aligns to 8-byte boundaries, so the next possible size is 24 bytes). list's very small overallocation (in the ballpark of 5%) is also aimed at minimizing memory burden. Speed isn't everything, tradeoffs are hard, and something loses when something else is favored. (BTW, I never would have considered implementing a resizable list type *without* keeping a field to record the amount of overallocation, so this isn't the tradeoff I would have made <wink>.)

Just in case nobody had thought of it before, couldn't the realloc() call be avoided when roundupsize(oldsize) == roundupsize(newsize)???

...

...
... And my interest is in a collections module which should probably published elsewhere (too much negativity on the subject here).

A collections *package* is a fine idea. I say "package" instead of "module" because there are literally hundreds of container types, and queues and bags are a tiny corner of that space. An example dear to my heart is BTrees (I maintain Zope's BTree implementation), which is a wonderful way to get an associative mapping sorted by key value. BTrees have a large API all by themselves, so it wouldn't make sense to try to squash them into a module along with 50 other container types.

Right. +1 for a package. --Guido van Rossum (home page: http://www.python.org/~guido/)

Tim Peters

10:14 p.m.

New subject: collections module

[Guido]

...

Just in case nobody had thought of it before, couldn't the realloc() call be avoided when roundupsize(oldsize) == roundupsize(newsize)???

I had thought of it before, but never acted on it <wink>. roundupsize() has gotten more expensive over the years too, and it may vary across boxes whether it's cheaper to call that a second time than it is to make the system realloc deduce that nothing needs to be done. roundupsize() is certainly cheaper than Microsoft's realloc() in the do-nothing case. Note that there's an endcase glitch here when oldsize==0 and newsize==1. roundupsize() returns 8 for both, but it doesn't follow that ob_item isn't NULL. realloc(p, n) special-cases the snot out of p==NULL to avoid crashing then. Note in passing that roundupsize() is expensive *because* lists don't remember the amount of overallocated space too: roundupsize() has to arrange to deliver the same result for many consecutive inputs, to prevent asking the platform for non-trivial reallocs() ("trivial realloc" == "do-nothing realloc" == realloc(p, n) called with fixed p and n multiple successive times). If we knew the point at which the "excess" slots were exhausted, then, e.g., doing realloc(oldsize + (oldsize >> 4)) at that point would give similar behavior with a much cheaper "new size" function (if we only computed a new size when a new size was actually needed, the "new size" function wouldn't have to arrange to remain constant over (eventually) unboundedly long stretches of consecutive inputs).

Tim Peters

12:46 a.m.

New subject: collections module

[Guido]

...

Just in case nobody had thought of it before, couldn't the realloc() call be avoided when roundupsize(oldsize) == roundupsize(newsize)???

Followup: it's not quite that easy, because (at least) PyList_New(int size) can create a list whose allocated size hasn't been rounded up. If I "fix" that, then it's possible to get away with just one roundupsize() call on most list.insert() calls (including list.append()), while avoiding the realloc() call too. Patch attached. I timed it like so: def punchit(): from time import clock as now a = [] push = a.append start = now() for i in xrange(1000000): push(i) finish = now() return finish - start for i in range(10): print punchit() and got these elapsed times (this is Windows, so this is elapsed wall-clock time): before after ------------- -------------- 1.05227710823 1.02203188119 1.05532442716 0.569660068053 1.05461539751 0.568627533147 1.0564449622 0.569336562799 1.05964146231 0.573247959235 1.05679528655 0.568503494862 1.05569402772 0.569745553898 1.05383177727 0.569499991619 1.05528000805 0.569163914916 1.05636618113 0.570356526258 So pretty consistent here, apart from the glitch on the first "after" run, and about an 80% speedup. Yawn <wink>. If someone wants to run with this, be my guest. There may be other holes (c.f. PyList_New above), and it could sure use a comment about why it's correct (if it in fact is <0.9 wink>).

Guido van Rossum

5:14 p.m.

New subject: collections module

...

[Guido]

...
Just in case nobody had thought of it before, couldn't the realloc() call be avoided when roundupsize(oldsize) == roundupsize(newsize)???

[Tim]

...

Followup: it's not quite that easy, because (at least) PyList_New(int size) can create a list whose allocated size hasn't been rounded up.

If I "fix" that, then it's possible to get away with just one roundupsize() call on most list.insert() calls (including list.append()), while avoiding the realloc() call too. Patch attached.

I timed it like so:

def punchit(): from time import clock as now a = [] push = a.append start = now() for i in xrange(1000000): push(i) finish = now() return finish - start

for i in range(10): print punchit()

and got these elapsed times (this is Windows, so this is elapsed wall-clock time):

before after ------------- -------------- 1.05227710823 1.02203188119 1.05532442716 0.569660068053 1.05461539751 0.568627533147 1.0564449622 0.569336562799 1.05964146231 0.573247959235 1.05679528655 0.568503494862 1.05569402772 0.569745553898 1.05383177727 0.569499991619 1.05528000805 0.569163914916 1.05636618113 0.570356526258

So pretty consistent here, apart from the glitch on the first "after" run, and about an 80% speedup.

Yawn <wink>.

If someone wants to run with this, be my guest. There may be other holes (c.f. PyList_New above), and it could sure use a comment about why it's correct (if it in fact is <0.9 wink>).

Awesome! We need timing results on some other platforms (Linux, FreeBSD, MacOSX). Volunteers? --Guido van Rossum (home page: http://www.python.org/~guido/)

John Williams

7:10 p.m.

New subject: list.append optimization Was: collections module

Tim's optimization seems to help even more on Linux than Windows. On my box each iteration of the test program takes 0.80s without the patch and 0.27 with it. jw

Tim Peters

7:24 p.m.

New subject: list.append optimization Was: collections module

[John Williams]

...

Tim's optimization seems to help even more on Linux than Windows. On my box each iteration of the test program takes 0.80s without the patch and 0.27 with it.

Wow! That's better than I had hoped for. I'll note one subtlety: the call to realloc() probably endures mutex locking overhead to keep it threadsafe, even if the realloc() is (in the end) a nop. It certainly does under Microsoft's realloc(), but the native locking gimmicks on Windows are very fast (Windows loves threads).

Jeff Epler

7:53 p.m.

New subject: collections module

To repeat findings on a Linux machine (redhat9, x86, 2.4GHz): Old New 0.74 0.22 0.74 0.22 0.74 0.22 0.74 0.22 0.72 0.21 0.76 0.22 0.74 0.22 0.77 0.22 0.75 0.22 0.74 0.22 I also got the same (?) failure as tim described: test_list make: *** [test] Segmentation fault Jeff

Tim Peters

8:30 p.m.

New subject: collections module

[Jeff Epler]

...

To repeat findings on a Linux machine (redhat9, x86, 2.4GHz): Old New 0.74 0.22 0.74 0.22 0.74 0.22 0.74 0.22 0.72 0.21 0.76 0.22 0.74 0.22 0.77 0.22 0.75 0.22 0.74 0.22

I also got the same (?) failure as tim described: test_list make: *** [test] Segmentation fault

Yup, that's the one. This line in listsort(): self->ob_item = empty_ob_item = PyMem_NEW(PyObject *, 0); violates the new assumption too; it can be replaced with, e.g., self->ob_item = empty_ob_item = PyMem_NEW(PyObject *, roundupsize(0) * sizeof(PyObject*)); although that should really be checked for a NULL return too.

Jeff Epler

8:53 p.m.

New subject: collections module

On Tue, Jan 13, 2004 at 01:53:31PM -0600, Jeff Epler wrote:

...

I also got the same (?) failure as tim described: test_list make: *** [test] Segmentation fault

test_sort (test.test_list.ListTest) ... ==32746== ==32746== Invalid write of size 4 ==32746== at 0x806A855: ins1 (listobject.c:175) ==32746== by 0x806B144: listappend (listobject.c:637) ==32746== by 0x80A5928: call_function (ceval.c:3440) ==32746== by 0x80A3F16: eval_frame (ceval.c:2127) ==32746== Address 0x41302B2C is 0 bytes inside a block of size 1 alloc'd ==32746== at 0x40160DB8: malloc (in /usr/lib/valgrind/valgrind.so) ==32746== by 0x806B50E: listsort (listobject.c:1884) ==32746== by 0x80EEF07: PyCFunction_Call (methodobject.c:118) ==32746== by 0x805AAD3: PyObject_Call (abstract.c:1759) It's in the self-modifying list test: z = range(12) def c(x, y): z.append(1) # HERE return cmp(x, y) z.sort(c) Jeff

kbk＠shore.net

2:02 a.m.

New subject: collections module

Guido van Rossum <guido@python.org> writes:

...

Awesome! We need timing results on some other platforms (Linux, FreeBSD, MacOSX).

OpenBSD 3.3, i386 2.4Ghz P4 Celeron, 256MB, 128kB cache before after ------ ----- 1.2 0.64 1.15 0.60 1.16 0.60 1.16 0.60 1.16 0.61 1.15 0.60 1.13 0.61 1.15 0.59 1.14 0.62 1.15 0.60 before: 15773 pystones, 29.9 mPB after: 12690 pystones, 31.4 mPB Switch back to CVS: 19531, 19379, 19531, 19379, 19305, 19084 pystones, 31.4 mPB Switch back to Tim's patch: 15015, 15480, 14588 pystones should not need, but: make clean && make (just make on previous, only listobject.c being recompiled) Still patched: 13158, 14006, 13193; restart python; 15923, 16722, 16023 restart python; 12953, 13698, 12953, 13513 Well, pystones are all over the map on this box. But the improvement is there for punchit and the parrotbench benchmark is solid across the change. -- KBK

Guido van Rossum

4:06 p.m.

New subject: collections module

...

OpenBSD 3.3, i386 2.4Ghz P4 Celeron, 256MB, 128kB cache

before after ------ ----- 1.2 0.64 1.15 0.60 1.16 0.60 1.16 0.60 1.16 0.61 1.15 0.60 1.13 0.61 1.15 0.59 1.14 0.62 1.15 0.60

before: 15773 pystones, 29.9 mPB after: 12690 pystones, 31.4 mPB

What's the mPB number? If its the time to complete parrotbench, this is actually a slowdown, just as for pystone.

...

Switch back to CVS:

19531, 19379, 19531, 19379, 19305, 19084 pystones, 31.4 mPB

Switch back to Tim's patch:

15015, 15480, 14588 pystones

should not need, but: make clean && make (just make on previous, only listobject.c being recompiled)

Still patched: 13158, 14006, 13193; restart python; 15923, 16722, 16023 restart python; 12953, 13698, 12953, 13513

Well, pystones are all over the map on this box. But the improvement is there for punchit and the parrotbench benchmark is solid across the change.

Unclear. Please clarify mPB. --Guido van Rossum (home page: http://www.python.org/~guido/)

kbk＠shore.net

8:04 p.m.

New subject: collections module

Guido van Rossum <guido@python.org> writes:

...

...
before: 15773 pystones, 29.9 mPB after: 12690 pystones, 31.4 mPB

What's the mPB number? If its the time to complete parrotbench, this is actually a slowdown, just as for pystone.

...
Switch back to CVS:

19531, 19379, 19531, 19379, 19305, 19084 pystones, 31.4 mPB

Switch back to Tim's patch:

15015, 15480, 14588 pystones

should not need, but: make clean && make (just make on previous, only listobject.c being recompiled)

Still patched: 13158, 14006, 13193; restart python; 15923, 16722, 16023 restart python; 12953, 13698, 12953, 13513

Well, pystones are all over the map on this box. But the improvement is there for punchit and the parrotbench benchmark is solid across the change.

Unclear. Please clarify mPB.

milliParrotBenches/sec. 1000/(time to complete Parrotbench [sec]) In a few years we'll be spec'ing kPB :-) So the data so far is 29.9 mPB unpatched 31.4 mPB patched 31.4 mPB unpatched This morning: 30.7, 30.4, 30.1, 32.1, 32.2, 31.9 mPB patched (using make time) rm Objects/listobject.c && cvs up Objects/listobject.c make clean && make 30.1, 30.2, 30.3, 32.1, 29.6, 29.9, 28.2 mPB unpatched Average unpatched: 30.2 mPB Average patched: 31.3 mPB So a 3 - 4% improvement, maybe. Needs more data. I reported the equivalent of 31.1 mPB on 2Jan04, which was the best of several trials. The 'best of 3' pystones at that time was 17182. Pystone ------- My impression is that Pystone isn't a good benchmark for this box unless run many times over a long period with python restarts, perhaps a cron at night. It may be the 128kB cache, but the results are very variable. This morning, with patched listobject.c, it's 15015, 15060, 14881; restart python; 16891, 17182, 17361 restart python; 14925, 14970, 14925, 14492, 14662, 14836, 15060, 15106, 15060 append sys.path and import a module: 13889, 13927, 14084 do some stuff, then: 14662, 14368, 14577 restart python; 14836, 14881, 14836 restart python; 15337, 15337, 15337, 15773, 15873, 15674, 15479, 15432, 14792 Following the switch back to unpatched mentioned above: 18182, 18050, 18248, 18450, 18248, 18181 restart python; 14881, 14970, 14705, 14836, 14749, 14970 restart python; 16835, 16835, 16778, 17064, 17182, 16892 restart python; append sys.path, import module; 20000, 20408, 20080, 20808, 19763, 20000 do some stuff; 19305, 19762, 19607, 19084, 19455, 19455 restart python; 17605, 17731, 17482, 17544, 17544, 17483 On these two sets: Average unpatched: 17865 pystones Average patched: 15136 pystones A 15% reduction with the patch. This *may* be significant, but requires a lot more data here and investigation by others. On this box the pystone rating jumps between python restarts. I haven't calculated any more advanced statistics; IMHO with small data sets, especially jumpy, ragged ones like these, you can often get a better feel by just eyeballing it. The downside being, you tend to see what you want to see.... -- KBK

Jeff Epler

4:40 p.m.

New subject: collections module

I did my own pybenching, and then wanked with a spreadsheet. Python-listopt seems a hair faster (~2%) I think the t-test says there's 90% confidence that the difference is statistically significant. I haven't had a chance to run the pie benchmark. redhat9, x86 2.4GHz, Python CVS. Python-cvs Python-listopt 29239.8 30303 29940.1 30864.2 29585.9 30487.8 29585.8 29411.8 29940.1 30487.8 29411.8 30303 29761.9 30303 27027 30303 30303 30487.8 29069.8 27777.8 Minimum 27027 27777.8 Maximum 30303 30864.2 Median 29585.85 30303 Average 29386.52 30072.92 Std.Dev 904.49 885.61 T-test 10.4% =ttest(b2:b11; c2:c11; 2; 3)

Jeff Epler

5:01 p.m.

New subject: collections module

Here are the parrotbench results, using the "user" field from "make time" output. There's a clear speedup of nearly 1 second in the -listopt version. During parts of these tests my system was also running nice'd distcc jobs, but since user time was measured this shouldn't matter much. "real" times were slower for python-listopt (0.56 seconds difference of median values) and more scattered (1.52 std dev vs 0.32 std dev) for this reason. Python-cvs Python-listopt 18.32 18.39 18.44 18.39 18.83 17.53 18.75 17.38 18.63 17.37 18.91 17.36 18.55 17.63 18.66 18.06 18.68 17.38 18.74 17.92 Minimum 18.32 17.36 Maximum 18.91 18.39 Median 18.67 17.58 Average 18.65 17.74 Std.Dev 0.18 0.42 T-test 0.00% =ttest(b2:b11; c2:c11; 2; 3) Difference of median values -0.91

Jeff Epler

9:42 p.m.

New subject: collections module

On Wed, Jan 14, 2004 at 11:01:42AM -0600, Jeff Epler wrote:

...

Here are the parrotbench results, using the "user" field from "make [snip]

I tried to reproduce these results on a Fedora Core 1 machine (x86, 1GHz) but found that the unpatched version performed better by about .4%. I realized that the binary I used had threading disabled. I recompiled and measured the patched version as 1% slower. Jeff CVS List Optimization FC1 28.73 28.9 No threading 28.78 28.84 "user" time 28.65 28.78 on parrotbench 28.77 28.83 28.81 28.87 28.84 28.9 28.65 28.83 28.75 28.87 28.67 28.82 28.79 28.95 Minimum 28.65 28.78 Maximum 28.84 28.95 Median 28.76 28.86 Mean 28.74 28.86 St.Dev 0.07 0.05 T-test 0.05% Difference 0.11 0.40% of means FC1 29.08 29.32 Threading 29.03 29.24 "user" time 29.52 29.37 on parrotbench 29.2 29.31 29.5 29.82 29.12 29.81 29.12 29.69 29.1 29.3 28.99 29.67 29.11 29.12 Minimum 28.99 29.12 Maximum 29.52 29.82 Median 29.12 29.35 Mean 29.18 29.47 St.Dev 0.18 0.26 T-test 1.05% Difference 0.29 0.99% of means

Bob Ippolito

3:55 a.m.

New subject: collections module

On Jan 13, 2004, at 12:14 PM, Guido van Rossum wrote:

...

...
[Guido]

...
Just in case nobody had thought of it before, couldn't the realloc() call be avoided when roundupsize(oldsize) == roundupsize(newsize)???

[Tim]

...
Followup: it's not quite that easy, because (at least) PyList_New(int size) can create a list whose allocated size hasn't been rounded up.

If I "fix" that, then it's possible to get away with just one roundupsize() call on most list.insert() calls (including list.append()), while avoiding the realloc() call too. Patch attached.

I timed it like so:

def punchit(): from time import clock as now a = [] push = a.append start = now() for i in xrange(1000000): push(i) finish = now() return finish - start

for i in range(10): print punchit()

and got these elapsed times (this is Windows, so this is elapsed wall-clock time):

before after ------------- -------------- 1.05227710823 1.02203188119 1.05532442716 0.569660068053 1.05461539751 0.568627533147 1.0564449622 0.569336562799 1.05964146231 0.573247959235 1.05679528655 0.568503494862 1.05569402772 0.569745553898 1.05383177727 0.569499991619 1.05528000805 0.569163914916 1.05636618113 0.570356526258

So pretty consistent here, apart from the glitch on the first "after" run, and about an 80% speedup.

Yawn <wink>.

If someone wants to run with this, be my guest. There may be other holes (c.f. PyList_New above), and it could sure use a comment about why it's correct (if it in fact is <0.9 wink>).

Awesome! We need timing results on some other platforms (Linux, FreeBSD, MacOSX).

Volunteers?

Mac OS X 10.3.2, gcc (GCC) 3.3 20030304 (Apple Computer, Inc. build 1495), dual g5 2ghz w/ 1gb ram, ./configure --disable-framework before after 0.38 0.29 0.36 0.26 0.36 0.26 0.36 0.27 0.36 0.26 0.36 0.26 0.36 0.26 0.36 0.26 0.36 0.27 0.36 0.26 Mac OS X 10.3.2, gcc (GCC) 3.3 20030304 (Apple Computer, Inc. build 1495), dvi titanium powerbook g4 1ghz w/ 1gb ram, ./configure --disable-framework before after 0.78 0.57 0.74 0.53 0.73 0.53 0.74 0.52 0.73 0.52 0.76 0.53 0.73 0.53 0.75 0.53 0.75 0.53 0.75 0.52 So it seems that the 2ghz g5 is about twice as fast as the 1ghz g4 for Python (when running TIm's test), and that either way you save about 25% with Tim's patch. -bob

Raymond Hettinger

5:53 p.m.

New subject: list.append optimization Was: collections module

...

So pretty consistent here, apart from the glitch on the first "after" run, and about an 80% speedup.

Yawn <wink>.

If someone wants to run with this, be my guest. There may be other holes (c.f. PyList_New above), and it could sure use a comment about why it's correct (if it in fact is <0.9 wink>).

I'll look at the this this weekend. The improvement is awesome. Hope it has doesn't muck-up anything else. Raymond Hettinger ################################################################# ################################################################# ################################################################# ##### ##### ##### ################################################################# ################################################################# #################################################################

Tim Peters

7:34 p.m.

New subject: list.append optimization Was: collections module

[Raymond Hettinger]

...

I'll look at the this this weekend.

The improvement is awesome. Hope it has doesn't muck-up anything else.

It probably does. It introduces a deep assumption that a list is *never* allocated, or resized, to hold ob_size slots, but always to hold (at lesat) roundupsize(ob_size) slots. If that's ever wrong, the "after" code may not call realloc() when a realloc is needed (for example, leave out the part that patched PyList_New(), and watch Python crash all over the place). One of the standard tests crashed, but I didn't have time to figure out why. It seemed odd because it died in some sorting test. A note about why it "should be" correct: for all i >= 0 roundupsize(i) > i [1] is true. So, calling ob_size "n" for brevity, if we ever reach a point where roundupsize(n) == n+1 then since roundupsize(n+1) > n+1 (by #1, letting i = n+1) it follows that roundupsize(n+1) > n+1 == roundupsize(n) Or, IOW, when roundupsize(n)==n+1, we're exactly at a boundary where roundupsize(n+1) grows, so that's a point at which adding a new element requires calling realloc. Otherwise roundupsize(n) > n+1 (since roundupsize(n) > n by #1, it's not possible that roundupsize(n) < n+1), so there's room enough already for an additional element, and realloc() doesn't need to be called.

Michael Hudson

11:08 a.m.

New subject: collections module

Guido van Rossum <guido@python.org> writes:

...

...
A collections *package* is a fine idea. I say "package" instead of "module" because there are literally hundreds of container types, and queues and bags are a tiny corner of that space. An example dear to my heart is BTrees (I maintain Zope's BTree implementation), which is a wonderful way to get an associative mapping sorted by key value. BTrees have a large API all by themselves, so it wouldn't make sense to try to squash them into a module along with 50 other container types.

Right. +1 for a package.

I've added this idea to PEP 320 to try to stop it disappearing into the murk... Cheers, mwh -- In short, just business as usual in the wacky world of floating point <wink>. -- Tim Peters, comp.lang.python

Josiah Carlson

9:28 a.m.

New subject: collections module

...

It needn't be especially fancy. Here is what I had in mind:

n = 50 class fifo(object):

def __init__(self): # block[0:n] is for data and block[n] is a link field self.head = self.tail = [None] * (n+1) self.headptr = self.tailptr = 0 ...etc...

I've got a few of the standard FIFO implementations sitting here, and I thought I would give them a run against each other. Testing methods are as follows: Initially create a FIFO of n elements. Alternate between pushing and popping 2n elements from the FIFO. Empty the FIFO. Timing each block, we get the following: Create Steady Empty n = 10000: Fifo 0.047 0.328 0.141 Fifo_list 0.171 0.485 0.14 fast_list_fifo 0.0 0.297 0.157 doublelistfifo 0.0 0.359 0.172 fifo_raymond 0.109 0.422 0.172 n = 20000: Fifo 0.093 0.657 0.297 Fifo_list 0.375 0.984 0.266 fast_list_fifo 0.0 0.594 0.312 doublelistfifo 0.0 0.719 0.344 fifo_raymond 0.203 0.859 0.329 n = 30000: Fifo 0.157 0.984 0.437 Fifo_list 0.563 1.453 0.422 fast_list_fifo 0.0 0.89 0.469 doublelistfifo 0.016 1.078 0.515 fifo_raymond 0.297 1.297 0.5 n = 40000: Fifo 0.203 1.328 0.594 Fifo_list 0.984 1.953 0.594 fast_list_fifo 0.015 1.188 0.625 doublelistfifo 0.0 1.437 0.687 fifo_raymond 0.438 1.734 0.656 Fifo: dictionary used as a fifo fifo_list: [cur, [next, [...]]] fast_list_fifo: push appends, pop increments a pointer, but never removes element doublelistfifo: self explanatory fifo_raymond: the fifo described by raymond. Initialization for Raymond's fifo can be ignored, it is merely inserting objects one at a time (I am short on time this evening). Interpret the rest of the numbers as is reasonable, but consider that there is likely a few percent margin of error on each number, unless something always beats another. I must be going to bed now, the wife is yelling at me. - Josiah

kbk＠shore.net

2:07 a.m.

New subject: collections module

"Tim Peters" <tim.one@comcast.net> writes:

...

[Kurt B. Kaiser]

...
If the overall slowdown of lists to get fast queues was 5%, that would no doubt be unacceptable. If it was 0.5%, I imagine it would be acceptable?

Probably not -- it's too much the tail wagging the dog, as Python lists, used for what they're best at, are core language workhorses already.

Certainly true.

...

Timing changes under 10% are very hard to establish convincingly, though, especially because the relative performance of Python features varies so widely across platforms (some mix of the code the platform C compiler generates, the quality of the platform C libraries, and pure accidents, like whether the top of the main eval loop happens to end up fighting severe I-cache conflicts with other high-frequency code).

If 0.5% is too much, and if the impact can't be measured reliably at 20x that level it's Catch 22.

...

...
Note that if no pop(0)'s are ever done on the list there is no wrap around and the slowdown is limited to the access of blocksize plus a test; no modulo arithmetic is necessary. This is also the case if the list doesn't grow and items are only removed, including list[0]. And in the latter case, the memmoves associated with internal deletions will dominate the performance in all implementations.

Neither case holds for a queue;

I was trying to make the point that in a non-queue situation the impact would be minimized.

...

the question that started this was whether to provide a fancy queue implementation in C, and (of course) Guido would rather see Python lists efficiently usable for that purpose too. I suspected that might not be a realistic hope, and now I'm starting to *think* it too <wink>.

Well, the speculative gimmick appears to put almost the whole load on the queues and would surely get a significant improvement over Queue without (apparently) unduly affecting other list operations. Interesting discussion! If I need a queue extension I know where to start. -- KBK

Tim Peters

4:22 a.m.

New subject: collections module

[Tim]

...

...
... We use mildly exponential overallocation when growing a list to ensure amortized O(1) behavior regardless.

[Kurt B. Kaiser]

...

The whole game lies in what happens when you resize: how much data are you going to copy?

We currently don't copy anything when we resize, although the platform realloc() may. To "make room at the left end" we would have to do a memmove() on a contiguous vector of len(list) pointers. I don't agree that's "the whole game", though -- it's a small piece of the game (the point of using exponential overallaction is to ensure that). As I said in the first msg, though, the overallocation gimmick we use now is less effective if a list can grow and shrink at both ends.

...

... If I understand the gimmick, as the queue is used

If a list is used as a queue, yes.

...

the ob_item/headpointer will walk down the block and items will be appended at the tail until the tail collides with the end of the block. At that point the gimmick approach will memmove the queue to place ob_item at the beginning of the block.

Possibly -- nobody has given enough detail to say. I'd be inclined to continue doing what we do now when an append runs out of room, which is to ask for more room, overallocating by a small percentage of the current list size. If there's free room "at the left end", though, it may or may not be more efficient to memmove the guts left. Those are important details that haven't gotten beyond the hand-wavy stage.

...

Since the block is only a little longer than the queue, that's quite a bit of copying, unless I'm missing something.

The amount of copying in *any* case where the speculative gimmick copies is proportional to the number of elements currenty in the list. When a list contains a million items, it's expensive. If a list contains a few dozen items, it's cheap. The current scheme overallocates proportional to the number of elements in the list, so "a little longer" is "a little" in a relative sense rather than in an absolute sense (e.g., it's millions of elements longer if the list contains hundreds of millions of elements -- then you can go through millions of appends after a resize before needing to do anything fancy again).

...

With a circular buffer approach, the headpointer and tailpointer will walk down the block and when the end is reached the tailpointer will wrap around and "appended" items will be stored at the low end of the block. The process continues without any copying.

So long as it doesn't run out of room. Eventually it will, and then it has to move stuff too. At that level the schemes are the same. Given specific common queue access patterns, a queue in steady state will "run out of room" more often under the contiguous gimmick than the circular one (I gave a simple example of that before, which in fact is probably a best case for the circular implementation and a worst case for the other one).

...

The position of the pointers is calculated modulo the blocksize, so no test/branch is involved.

Integer mod is much worse than test/branch -- Python runs on boxes where an integer divide takes more than 100 cycles. So I was assuming the cheaper way to implement this ("if (physical_pointer >= fence) physical_pointer -= block_size;"; test-and-conditionally-add/subtract instead of integer mod; and don't suggest that allocated sizes be restricted to powers of 2 just to make integer mod efficient (we can't afford that much memory waste)).

...

If there is some space at the beginning of the block which has been freed up by pop(0), the list will automatically wrap around to fill it, even when the pop(0) has been done in a non-queue context.

You can trust that my objection isn't that I don't understand how to implement a circular queue <wink>.

...

With a circular buffer, upon resize the gap between tail and head is widened by the amount added to the block. With a queue, that would on average amount to copying half the data, but with a known stride.

I don't know what "known stride" means in this context. Regardless of whether a circular implementation is used, a Python list is a contiguous vector of C pointers, and they're all the same size. Only these pointers are moved. With a reasonably careful circular implementation, needing to move half the pointers would be the worst case; assuming that the head and tail are equally likely to bump into each other at each possible position, the average number of pointer moves would be len/4.

...

One could take a conservative approach and rotate the buffer every resize so that ob_item is at the beginning of the block, but doing that seems to be unnecessary.

Also unhelpful -- every line of C code mucking with lists would still have to cater to the possibility that the list is spread across two disjoint memory slices. ...

...

...
As above, I don't see much to recommend that over the speculative Python list gimmick: indexing is slower and more complicated,

...

pointer arithmetic is cheap compared to memory access.

Let's inject some common sense here. Indexed list access is an extremely frequent operation, and cache-friendly sequential indexed list access is also extremely frequent. Moving memory around is a rare operation, and neither scheme can avoid it: the overwhelmingly most common kind of growing list will continue to be append-at-the-end, where the circular scheme buys nothing, but slows down all list operations. Growing lists are in turn rare compared to fixed-size lists, where the circular implementation is pure loss, in both time and space. The speculative gimmick is a loss in space there too, but doesn't slow it.

...

...
resizing is more complicated,

...

I'm not sure that's the case.

Neither am I, really <wink>.

...

And the gimmick approach has a 'complication' when the tail hits the end of the block even when the queue lenth is constant.

That appears to be its only drawback.

...

...
and all the listobject.c internals would get more complicated by needing to deal with that the list contents are no longer necessarily in a contiguous slice of a C vector (from the C POV, the list contents could be in two disjoint contiguous slices).

...

Not if modulo pointer arithmetic is used.

Sorry, but I consider this one too unrealistic to continue talking about. Start here: #define PyList_GET_ITEM(op, i) (((PyListObject *)(op))->ob_item[i]) #define PyList_SET_ITEM(op, i, v) \ (((PyListObject *)(op))->ob_item[i] = (v)) What are you going to replace those with? They must be fast, and they're expanded in over 100 places in the core. Then move on to here (the simplest function in listobject.c that actually does something <wink>): /* Reverse a slice of a list in place, from lo up to (exclusive) hi. */ static void reverse_slice(PyObject **lo, PyObject **hi) { assert(lo && hi); --hi; while (lo < hi) { PyObject *t = *lo; *lo = *hi; *hi = t; ++lo; --hi; } } How are you going to rewrite that? For that matter, how are you going to change the function *signature* when "a slice of a Python list" can no longer be represented by a low and high pointer? When you're done with that, do this one: /* Merge the na elements starting at pa with the nb elements starting at pb * in a stable way, in-place. na and nb must be > 0, and pa + na == pb. * Must also have that *pb < *pa, that pa[na-1] belongs at the end of the * merge, and should have na <= nb. See listsort.txt for more info. * Return 0 if successful, -1 if error. */ static int merge_lo(MergeState *ms, PyObject **pa, int na, PyObject **pb, int nb) { int k; PyObject *compare; PyObject **dest; int result = -1; /* guilty until proved innocent */ int min_gallop = ms->min_gallop; assert(ms && pa && pb && na > 0 && nb > 0 && pa + na == pb); if (MERGE_GETMEM(ms, na) < 0) return -1; memcpy(ms->a, pa, na * sizeof(PyObject*)); dest = pa; pa = ms->a; *dest++ = *pb++; --nb; if (nb == 0) goto Succeed; if (na == 1) goto CopyB; compare = ms->compare; for (;;) { int acount = 0; /* # of times A won in a row */ int bcount = 0; /* # of times B won in a row */ /* Do the straightforward thing until (if ever) one run * appears to win consistently. */ for (;;) { assert(na > 1 && nb > 0); k = ISLT(*pb, *pa, compare); if (k) { if (k < 0) goto Fail; *dest++ = *pb++; ++bcount; acount = 0; --nb; if (nb == 0) goto Succeed; if (bcount >= min_gallop) break; } else { *dest++ = *pa++; ++acount; bcount = 0; --na; if (na == 1) goto CopyB; if (acount >= min_gallop) break; } } /* One run is winning so consistently that galloping may * be a huge win. So try that, and continue galloping until * (if ever) neither run appears to be winning consistently * anymore. */ ++min_gallop; do { assert(na > 1 && nb > 0); min_gallop -= min_gallop > 1; ms->min_gallop = min_gallop; k = gallop_right(*pb, pa, na, 0, compare); acount = k; if (k) { if (k < 0) goto Fail; memcpy(dest, pa, k * sizeof(PyObject *)); dest += k; pa += k; na -= k; if (na == 1) goto CopyB; /* na==0 is impossible now if the comparison * function is consistent, but we can't assume * that it is. */ if (na == 0) goto Succeed; } *dest++ = *pb++; --nb; if (nb == 0) goto Succeed; k = gallop_left(*pa, pb, nb, 0, compare); bcount = k; if (k) { if (k < 0) goto Fail; memmove(dest, pb, k * sizeof(PyObject *)); dest += k; pb += k; nb -= k; if (nb == 0) goto Succeed; } *dest++ = *pa++; --na; if (na == 1) goto CopyB; } while (acount >= MIN_GALLOP || bcount >= MIN_GALLOP); ++min_gallop; /* penalize it for leaving galloping mode */ ms->min_gallop = min_gallop; } Succeed: result = 0; Fail: if (na) memcpy(dest, pa, na * sizeof(PyObject*)); return result; CopyB: assert(na == 1 && nb > 0); /* The last element of pa belongs at the end of the merge. */ memmove(dest, pb, nb * sizeof(PyObject *)); dest[nb] = *pa; return 0; } There are thousands of additional lines that assume a list is one contiguous vector of pointers, and that the dead-simple and killer-fast ordinary contiguous-vector C idioms work.

...

...
It would have the advantage that resizing is never needed until you're wholly out of room, but that seems minor compared to the drawbacks.

Just thought I'd repeat that <wink>.

Terry Reedy

6:28 p.m.

New subject: collections module

...

Possibly -- nobody has given enough detail to say. I'd be inclined to continue doing what we do now when an append runs out of room, which is to ask for more room, overallocating by a small percentage of the current

...

size. If there's free room "at the left end", though, it may or may not be more efficient to memmove the guts left. Those are important details

On enhancing list object: queue vs. deque. My general impression is that queues, while less common than stacks, are used enough in enough different algorithms that adding one implementation field to list headers so that pop(0) is an O(1) pointer move might be an overall win for Python. The current O(n) move everything every time behavior is certainly a nuisance when queues are needed. Deques, on the other hand, seem to be more a theoretical construct. Algorithm books include them for completeness in describing possible list disciplines, but generally ignore them thereafter. I am hard put to remember an algorithm that actually needs them. They are certainly much rarer than queues. So I do not think trying to make prepends O(1) should be considered further. list that

...

haven't gotten beyond the hand-wavy stage.

My possibly naive thought is that before reallocating, look at the ratio of left free to total. If memmove would result in as much, or perhaps nearly as much, over-allocation as realloc, then just do that. Certainly if the ratio is small (like say .1), don't bother with memmove. The exact cutoff is a tuning matter. Separate issue: for best performance, one would like to be able trade space for time and preallocate a large enough block so that most append faults result in a memmove back to the left. Does the current builtin shrinkage discipline would allow this. Or does "q=[0]*1000; q[:]=[]" shrink the space for q back to nearly nothing? (One might also want to do same for stack. And yes, I remember: 'user tuning' is a 'bad idea'. But....) Terry J. Reedy

Tim Peters

1:30 a.m.

New subject: collections module

[Terry Reedy]

...

On enhancing list object: queue vs. deque.

My general impression is that queues, while less common than stacks, are used enough in enough different algorithms that adding one implementation field to list headers so that pop(0) is an O(1) pointer move might be an overall win for Python. The current O(n) move everything every time behavior is certainly a nuisance when queues are needed.

See the other msgs in this thread today -- I'm not convinced that queue-like behavior can be supported that efficiently.

...

Deques, on the other hand, seem to be more a theoretical construct. Algorithm books include them for completeness in describing possible list disciplines, but generally ignore them thereafter. I am hard put to remember an algorithm that actually needs them. They are certainly much rarer than queues. So I do not think trying to make prepends O(1) should be considered further.

I expect that if queue-like behavior *can* be supported efficiently using an always-contiguous vector, efficient dequeue-like behavior will fall out for free.

...

My possibly naive thought is that before reallocating, look at the ratio of left free to total.

I don't know what that means (neither the "left free" part nor what exactly "total" is counting). The devil is in the details here.

...

If memmove would result in as much, or perhaps nearly as much, over-allocation as realloc,

memmove() leaves the total number of unused slots unchanged. realloc can do that too, or increase it by any amount, or decrease it by any amount. I just don't know what you're trying to say ... explain how it works, step by step, in the q = [incoming() for i in range(1000)] while True: q.append(incoming()) consume(q.pop(0)) example.

...

then just do that. Certainly if the ratio is small (like say .1), don't bother with memmove. The exact cutoff is a tuning matter.

Separate issue: for best performance, one would like to be able trade space for time and preallocate a large enough block so that most append faults result in a memmove back to the left. Does the current builtin shrinkage discipline would allow this. Or does "q=[0]*1000; q[:]=[]" shrink the space for q back to nearly nothing?

Python *offers* to give 992 slots back; whether the platform realloc() accepts the offer is up to it. In the "details are everything" department, Python *also* creates a temp vector of length 1000, and copies all of q[:] into it before it deletes any of q: emptying a list is a linear-time operation, not constant-time. That's because it's not safe to decref anything in a list while tearing it down (e.g., the list may be visible to a __del__ method, which must not be allowed to see NULL pointers, or reclaimed objects, in the list).

...

(One might also want to do same for stack. And yes, I remember: 'user tuning' is a 'bad idea'. But....)

Terry Reedy

4:21 a.m.

New subject: collections module

"Tim Peters" <tim.one@comcast.net> wrote in message news:LNBBLJKPBEHFEDALKOLCKEOHIEAB.tim.one@comcast.net...

...

[Terry Reedy]

...
So I do not think trying to make prepends O(1) should be considered further.

I expect that if queue-like behavior *can* be supported efficiently using an always-contiguous vector, efficient dequeue-like behavior will fall out for free.

Efficient inserts at 0 would seem to me to require space at the front which is usually wasted. But if you can avoid this, perhaps by only adding front space when prepends are attempted, great.

...

...
My possibly naive thought is that before reallocating, look at the ratio of left free to total.

I don't know what that means (neither the "left free" part nor what exactly "total" is counting). The devil is in the details here.

Whoops, I mixed Python and C viewpoints. By total, I meant the C size of the pointer array. By 'free', I meant the popped pointers 'below' the current 0 element, which from a C view, are not free but merely overwriteable.

...

...
If memmove would result in as much, or perhaps nearly as much, over-allocation as realloc,

memmove() leaves the total number of unused slots unchanged. realloc can do that too, or increase it by any amount, or decrease it by any amount. I just don't know what you're trying to say ... explain how it works, step by step, in the

q = [incoming() for i in range(1000)]

Lets assume that C array is 1024 at this point.

...

while True: q.append(incoming())

At 25th append, C array is full and there are 24 passe pointers. 24/1024 is small, so ignore them and expand as currently. Assume array is doubled to 2048 (since I don't know current actual value) and no memmove.

...

consume(q.pop(0))

After additional 1024 appends, array appears full again. However, there are 1048 unneeded pointers to 'left' of 0 position and 1048//2048 > .5 >= threshhold, so memmove top 1000 back to beginning of array which stays at size 2048. Every 1048 appends thereafter, repeat memmove in array that never again gets resized. In the long run, the average pointer moves per push/pop is less that 1.

...

...
then just do that. Certainly if the ratio is small (like say .1), don't bother with memmove. The exact cutoff is a tuning matter.

If expansion is currently by fraction f, then after expansion, f/(1+f) of slots are free at end of array. Let threshhold t be this fraction or perhaps a bit less. With proposed l[0] popping, suppose that at append fault, p of s slots are below l[0] pointer and that p/s >= t. Then do memmove instead of expansion. Rationale: after memmove, there will be as large (or almost as large) a fraction of slots available at end of list for append as there now are. I hope this is somewhat clearer. Terry J. Reedy

Martin v. Loewis

12:24 p.m.

New subject: collections module

Tim Peters wrote:

...

The overallocation strategy gets more complicated, though, and less effective for prepending. If you run out of room when prepending, it's not enough just to ask realloc() for more memory, you also have to memmove() the entire vector, to "make room at the left end". In previous lives I've found that "appending at the right" is as efficient as it is in practice largely because most non-trivial C realloc() calls end up extending the vector in-place, not needing to copy anything. The O() behavior is the same if each boost in size has to move everything that already existed, but the constant factor goes up more than just measurably <wink>.

Shouldn't you malloc() in the case of prepending, and do the copying yourself? realloc can only avoid copying if it can append a free block. For the list, we know a) realloc is unlikely to find a new block, anyway, as it is pymalloc, which never extends a small block. b) even if realloc would be able to extend the block, we still would have to copy the data. I don't know whether the overallocation would make similar-sized room at both ends, or whether it would take the current operation into account. For the specific case of queues, we may find that doing *just* the copying might be sufficient, with no need to go to the allocator. Regards, Martin

Guido van Rossum

4:45 p.m.

New subject: collections module

...

Shouldn't you malloc() in the case of prepending, and do the copying yourself? realloc can only avoid copying if it can append a free block. For the list, we know a) realloc is unlikely to find a new block, anyway, as it is pymalloc, which never extends a small block.

Not for large lists. pymalloc passes large objects off to real malloc.

...

b) even if realloc would be able to extend the block, we still would have to copy the data.

For really large lists, you'd risk running out of memory due to the malloc(1GB), while realloc(1GB) very likely just extends (since such a large block effectively becomes in a memory segment of its own).

...

I don't know whether the overallocation would make similar-sized room at both ends, or whether it would take the current operation into account. For the specific case of queues, we may find that doing *just* the copying might be sufficient, with no need to go to the allocator.

Hm. An idea: don't bother overallocating when inserting in the front, but do keep the free space in the front if possible. Then recommend that queus are implemented using append(x) and pop(0), rather than using insert(0, x) and pop(). --Guido van Rossum (home page: http://www.python.org/~guido/)

Raymond Hettinger

6:36 p.m.

New subject: collections module

...

...
I don't know whether the overallocation would make similar-sized room at both ends, or whether it would take the current operation into account. For the specific case of queues, we may find that doing *just* the copying might be sufficient, with no need to go to the allocator.

Hm. An idea: don't bother overallocating when inserting in the front, but do keep the free space in the front if possible. Then recommend that queus are implemented using append(x) and pop(0), rather than using insert(0, x) and pop().

That is a reasonable approach and probably not difficult to implement. Since external code (outside listobject.c) may already be relying on ob_item, the approach should just change that field to always point to the start of valid data. A new field would have to keep a pointer to the actual allocated block. Raymond Hettinger ################################################################# ################################################################# ################################################################# ##### ##### ##### ################################################################# ################################################################# #################################################################

Tim Peters

8:05 p.m.

New subject: collections module

[Martin v. Loewis]

...

...
Shouldn't you malloc() in the case of prepending, and do the copying yourself? r ealloc can only avoid copying if it can append a free block. For the list, we know

a) realloc is unlikely to find a new block, anyway, as it is pymalloc, which never extends a small block.

[Guido]

...

Not for large lists. pymalloc passes large objects off to real malloc.

pymalloc is irrelevant here: it's not used for list guts, just for list headers. This is an efficiency hack, to speed the append-at-the-end use case: as Martin says, pymalloc cannot extend in-place, so not using pymalloc saves the cycles pymalloc would have consumed copying small list guts repeatedly while the list was growing, and, after the list got large, re-deducing over and over that it has to redirect to the system realloc. This expense is exacerbated because Python doesn't keep track of how much it's overallocated, it *always* calls realloc() on an append, (but "almost always" calls realloc() with the same size it called realloc() with the last time).

...

...
b) even if realloc would be able to extend the block, we still would have to copy the data.

...

For really large lists, you'd risk running out of memory due to the malloc(1GB), while realloc(1GB) very likely just extends (since such a large block effectively becomes in a memory segment of its own).

Right, that's the idea.

...

...
I don't know whether the overallocation would make similar-sized room at both ends, or whether it would take the current operation into account. For the specific case of queues, we may find that doing *just* the copying might be sufficient, with no need to go to the allocator.

For example, in this steady-state queue: q = [x] while True: q.append(x) q.pop(0) q is never empty then, but never contains more than 1 element.

...

Hm. An idea: don't bother overallocating when inserting in the front, but do keep the free space in the front if possible.

I'm not clear on what that means, and the devil's in the details.

...

Then recommend that queues are implemented using append(x) and pop(0), rather than using insert(0, x) and pop().

Given the way realloc() works, I expect that add-at-the-right will always be favored. If someone does "insert at the left", though, and we're out of room, I'd give "the extra space" to the left end (we have to do a memmove in that case regardless, and shifting stuff right N slots doesn't cost more than shifting stuff right 1 slot). There are many possible strategies, and it's worth some pain to make dequeue use of lists reasonably zippy too.

Martin v. Loewis

10:21 p.m.

New subject: collections module

Tim Peters wrote:

...

...
Hm. An idea: don't bother overallocating when inserting in the front, but do keep the free space in the front if possible.

I'm not clear on what that means, and the devil's in the details.

I interpret it as: when doing an insert in the middle or at the beginning, continue doing what we do today. I.e. never overallocate extra space in the front, unless the extra space is already there (so realloc would just copy it/leave it as-is). When appending, you would then first check whether there is space at the beginning, and move items if there is. Otherwise reallocate, with the overallocation-to-the-right formula.

...

...
Then recommend that queues are implemented using append(x) and pop(0), rather than using insert(0, x) and pop().

Given the way realloc() works, I expect that add-at-the-right will always be favored.

Indeed, I would always implement lists as .append/.pop(0) - it would not have occurred to my mind to write them as insert(0, x)/.pop(), as I consider .insert on a list unintuitive. Regards, Martin

7715

Age (days ago)

7720

Last active (days ago)

List overview

Download

50 comments

13 participants

participants (13)

Aahz
Bob Ippolito
Guido van Rossum
Jeff Epler
John Williams
Josiah Carlson
kbk＠shore.net
Martin v. Loewis
Michael Hudson
Raymond Hettinger
Terry Reedy
Tim Peters
Zack Weinberg

Re: [Python-Dev] collections module

Josiah Carlson

Zack Weinberg

Josiah Carlson

John Williams

Michael Hudson

Josiah Carlson

tags

participants (13)