RE: [Python-Dev] collections module

Jan. 12, 2004


      [Raymond Hettinger]
...
No, I meant that it is *much* faster to do an indexed write to a
pre-allocated list than to use anything relying on PyList_Append() or
ins1().  All of those individual calls to realloc are expensive.  The
50 to 1 dilution of that expense is how it beats Knuth's two stack
trick.
That's fine.  As I said before, I've never had a use for a queue in Python
where the two-stack trick (coded in Python, even) wasn't more than fast
enough for me.

If I did, there are a universe of possibilities, including Marc-Andre's
mxQueue.  That doesn't use Python lists; doesn't call realloc on every queue
put (only when it's out of room, which is "never" when a queue reaches
steady state); and, when it does realloc, overallocates by a much larger
amount than Python's list type.
...
Guess which of these two statements is faster and by how much:
list(itertools.repeat(None, 5000))
     list(tuple(itertools.repeat(None, 5000)))
I did guess, and was correct to 6 significant decimal digits <wink>.
...
The answer is important because it points the way to faster list
comprehensions and other list building applications.
That Python's builtin list calls realloc() on every append is pretty gross,
but it's also a tradeoff, saving the bytes in the list header that would be
required to remember how many allocated-but-unused slots exist.  If you've
got a million tiny lists (and some apps do), burning an additional 8MB for
something your app doesn't really need isn't attractive (on a 32-bit box
today, the list header is 16 bytes; pymalloc aligns to 8-byte boundaries, so
the next possible size is 24 bytes).  list's very small overallocation (in
the ballpark of 5%) is also aimed at minimizing memory burden.  Speed isn't
everything, tradeoffs are hard, and something loses when something else is
favored.  (BTW, I never would have considered implementing a resizable list
type *without* keeping a field to record the amount of overallocation, so
this isn't the tradeoff I would have made <wink>.)
...
...
And my interest is in a collections module which should probably
published elsewhere (too much negativity on the subject here).
A collections *package* is a fine idea.  I say "package" instead of "module"
because there are literally hundreds of container types, and queues and bags
are a tiny corner of that space.  An example dear to my heart is BTrees (I
maintain Zope's BTree implementation), which is a wonderful way to get an
associative mapping sorted by key value.  BTrees have a large API all by
themselves, so it wouldn't make sense to try to squash them into a module
along with 50 other container types.
...
I am frankly surprised that so many here think that a collections
module would be bad for the language.
I was more of the impression that bags didn't have fans <wink>.
...
Go figure.  It's not like these are new, untried ideas -- the goal
was supposed be improving productivity by providing higher level
tools.  A fast underlying implementation was just icing on the cake.
I'm strongly in favor of a collections package.

RE: [Python-Dev] collections module

Tim Peters