Increasing pool size is one obvious way to fix these problems. I think 16KiB pool size and 2MiB (huge page size of x86) arena size is a sweet spot for recent web servers (typically, about 32 threads, and 64GiB), but there is no evidence about it.
Note that the OS won't give a huge page automatically, because memory management becomes much more inflexible then.
For example, the Linux madvise() man page has this to say about MADV_HUGEPAGE:
This feature is primarily aimed at applications that use large mappings of data and access large regions of that memory at a time (e.g., virtualization systems such as QEMU). It can very easily waste memory (e.g., a 2 MB mapping that only ever accesses 1 byte will result in 2 MB of wired memory instead of one 4 KB page). See the Linux kernel source file Documenta‐ tion/vm/transhuge.txt for more details.
I'm not sure a small objects allocator falls into the right use case for huge pages.
The SuperMalloc paper I recently pointed at notes that it uses huge pages only for "huge" requests. Not for "small", "medium", or "large" requests.
But it carves up 2 MiB chunks. aligned at 2 MiB addresses, for each size class anyway (which use 4K pages).
There are a mix of reasons for that. Partly for the same reasons I want bigger pools and arenas: to stay in the fastest code paths. Hitting page/arena/chunk boundaries costs cycles for computation and conditional branches, and clobbers cache lines to access & mutate bookkeeping info that the fast paths don't touch.
Also to reduce the fraction of allocator space "wasted" on bookkeeping info. 48 header bytes out of a 4K pool is a bigger percentage hit than, say, two 4K pages (to hold fancier allocator bookkeeping data structures) out of a 2M chunk.
And partly for the same reason Neil is keen for bigger arenas in his branch: to reduce the size of data structures to keep track of other bookkeeping info (in Neil's case, a radix tree, which can effectively shift away the lowest ARENA_BITS bits of addresses it needs to store).
Which hints at much of why it wants "huge" chunks, but doesn't explain why it doesn't want huge pages except to satisfy huge requests. That's because it strives to be able to release physical RAM back to the system on a page basis (which is also part of why it needs fancier bookkeeping data structures to manage its chunks - it needs to keep track of which pages are in use, and apply page-based heuristics to push toward freeing pages).
So that combines very much larger "pools" (2M v 4K) with better chances of actually returning no-longer-used pages to the system (on a 4K basis rather than a 256K basis). But it's built on piles of platform-specific code, and isn't suitable at all for 32-bit boxes (it' relies on that virtual address space is an abundant resource on 64-bit boxes - reserving 2M of address space is close to trivial, and could potentially be done millions of times without getting in trouble).