[Python-Dev] collections module

Mon Jan 12 16:17:39 EST 2004

[Raymond Hettinger]
> No, I meant that it is *much* faster to do an indexed write to a
> pre-allocated list than to use anything relying on PyList_Append() or
> ins1().  All of those individual calls to realloc are expensive.  The
> 50 to 1 dilution of that expense is how it beats Knuth's two stack
> trick.

That's fine.  As I said before, I've never had a use for a queue in Python
where the two-stack trick (coded in Python, even) wasn't more than fast
enough for me.

If I did, there are a universe of possibilities, including Marc-Andre's
mxQueue.  That doesn't use Python lists; doesn't call realloc on every queue
put (only when it's out of room, which is "never" when a queue reaches
steady state); and, when it does realloc, overallocates by a much larger
amount than Python's list type.

> Guess which of these two statements is faster and by how much:
>
>      list(itertools.repeat(None, 5000))
>      list(tuple(itertools.repeat(None, 5000)))

I did guess, and was correct to 6 significant decimal digits <wink>.

> The answer is important because it points the way to faster list
> comprehensions and other list building applications.

That Python's builtin list calls realloc() on every append is pretty gross,
but it's also a tradeoff, saving the bytes in the list header that would be
required to remember how many allocated-but-unused slots exist.  If you've
got a million tiny lists (and some apps do), burning an additional 8MB for
something your app doesn't really need isn't attractive (on a 32-bit box
today, the list header is 16 bytes; pymalloc aligns to 8-byte boundaries, so
the next possible size is 24 bytes).  list's very small overallocation (in
the ballpark of 5%) is also aimed at minimizing memory burden.  Speed isn't
everything, tradeoffs are hard, and something loses when something else is
favored.  (BTW, I never would have considered implementing a resizable list
type *without* keeping a field to record the amount of overallocation, so
this isn't the tradeoff I would have made <wink>.)

> ...
> And my interest is in a collections module which should probably
> published elsewhere (too much negativity on the subject here).

A collections *package* is a fine idea.  I say "package" instead of "module"
because there are literally hundreds of container types, and queues and bags
are a tiny corner of that space.  An example dear to my heart is BTrees (I
maintain Zope's BTree implementation), which is a wonderful way to get an
associative mapping sorted by key value.  BTrees have a large API all by
themselves, so it wouldn't make sense to try to squash them into a module
along with 50 other container types.

> I am frankly surprised that so many here think that a collections
> module would be bad for the language.

I was more of the impression that bags didn't have fans <wink>.

> Go figure.  It's not like these are new, untried ideas -- the goal
> was supposed be improving productivity by providing higher level
> tools.  A fast underlying implementation was just icing on the cake.

I'm strongly in favor of a collections package.