Thanks for the feedback.  The results make it clear that we should
somehow tune the number according to the load of the machine --
picking up the right number for the load can easily make a 20% speed
difference (at least on Mac OS X, but I strongly suspect the same is
true on other platforms).

Ideally, it should dynamically adapt its nursery size in order to
minimize the cache misses.  If anyone has a suggestion on how to
implement that, preferably in a non-OS-specific way (e.g. by reading
some x86 CPU counters), I'd welcome it :-)

