[Python-Dev] collections module
Tim Peters
tim.one at comcast.net
Mon Jan 12 16:17:39 EST 2004
[Raymond Hettinger]
> No, I meant that it is *much* faster to do an indexed write to a
> pre-allocated list than to use anything relying on PyList_Append() or
> ins1(). All of those individual calls to realloc are expensive. The
> 50 to 1 dilution of that expense is how it beats Knuth's two stack
> trick.
That's fine. As I said before, I've never had a use for a queue in Python
where the two-stack trick (coded in Python, even) wasn't more than fast
enough for me.
If I did, there are a universe of possibilities, including Marc-Andre's
mxQueue. That doesn't use Python lists; doesn't call realloc on every queue
put (only when it's out of room, which is "never" when a queue reaches
steady state); and, when it does realloc, overallocates by a much larger
amount than Python's list type.
> Guess which of these two statements is faster and by how much:
>
> list(itertools.repeat(None, 5000))
> list(tuple(itertools.repeat(None, 5000)))
I did guess, and was correct to 6 significant decimal digits <wink>.
> The answer is important because it points the way to faster list
> comprehensions and other list building applications.
That Python's builtin list calls realloc() on every append is pretty gross,
but it's also a tradeoff, saving the bytes in the list header that would be
required to remember how many allocated-but-unused slots exist. If you've
got a million tiny lists (and some apps do), burning an additional 8MB for
something your app doesn't really need isn't attractive (on a 32-bit box
today, the list header is 16 bytes; pymalloc aligns to 8-byte boundaries, so
the next possible size is 24 bytes). list's very small overallocation (in
the ballpark of 5%) is also aimed at minimizing memory burden. Speed isn't
everything, tradeoffs are hard, and something loses when something else is
favored. (BTW, I never would have considered implementing a resizable list
type *without* keeping a field to record the amount of overallocation, so
this isn't the tradeoff I would have made <wink>.)
> ...
> And my interest is in a collections module which should probably
> published elsewhere (too much negativity on the subject here).
A collections *package* is a fine idea. I say "package" instead of "module"
because there are literally hundreds of container types, and queues and bags
are a tiny corner of that space. An example dear to my heart is BTrees (I
maintain Zope's BTree implementation), which is a wonderful way to get an
associative mapping sorted by key value. BTrees have a large API all by
themselves, so it wouldn't make sense to try to squash them into a module
along with 50 other container types.
> I am frankly surprised that so many here think that a collections
> module would be bad for the language.
I was more of the impression that bags didn't have fans <wink>.
> Go figure. It's not like these are new, untried ideas -- the goal
> was supposed be improving productivity by providing higher level
> tools. A fast underlying implementation was just icing on the cake.
I'm strongly in favor of a collections package.
More information about the Python-Dev
mailing list