Results: 2.86% for 1 arg (len), 11.8% for 2 args (min), and 1.6% for pybench.
./python.exe -m timeit 'for x in xrange(10000): len()' ./python.exe -m timeit 'for x in xrange(10000): min(1,2)'
One part of it is a little dangerous though.
The general idea is to preallocate arg tuples and never dealloc. This saves a fair amount of work. I'm not sure it's entirely safe though.
I noticed in doing this patch that PyTuple_Pack() calls _New() which initializes each item to NULL, then in _Pack() each item is set to the appropriate value. If we could get rid of duplicate work like that (or checking values in both callers and callees), we could get more speed. In order to try and find functions where this is more important, you can use Walter's coverage results: