Re: [Numpy-discussion] Really cruel draft of vbench setup for NumPy (.add.reduce benchmarks since 2011)

May 6, 2013

      On Mon, 2013-05-06 at 10:32 -0400, Yaroslav Halchenko wrote:
...
On Wed, 01 May 2013, Sebastian Berg wrote:
...
...
btw -- is there something like panda's vbench for numpy?  i.e. where
it would be possible to track/visualize such performance
improvements/hits?
...
Sorry if it seemed harsh, but only skimmed mails and it seemed a bit
like the an obvious piece was missing... There are no benchmark tests I
am aware of. You can try:
...
a = np.random.random((1000, 1000))
...
and then time a.sum(1) and a.sum(0), on 1.7. the fast axis (1), is only
slightly faster then the sum over the slow axis. On earlier numpy
versions you will probably see something like half the speed for the
slow axis (only got ancient or 1.7 numpy right now, so reluctant to give
exact timings).
FWIW -- just as a cruel first attempt look at
http://www.onerussian.com/tmp/numpy-vbench-20130506/vb_vb_reduce.html
why float16 case is so special?
Float16 is special, it is cpu-bound -- not memory bound as most
reductions -- because it is not a native type. First thought it
was weird, but it actually makes sense, if you have a and b as float16:

a + b

is actually more like (I believe...):

float16(float32(a) + float32(b))

This means there is type casting going on *inside* the ufunc! Normally
casting is handled outside the ufunc (by the buffered iterator).  Now I
did not check, but when the iteration order is not optimized, the ufunc
*can* simplify this to something similar to this (along the reduction
axis):

result = float32(a[0])
for i in xrange(a[1:]):
    result += float32(a.next())
return float16(result)

While for "optimized" iteration order, this cannot happen because the
intermediate result is always written back.

This means for optimized iteration order a single conversion to float is
necessary (in the inner loop), while for unoptimized iteration order two
conversions to float and one back is done.  Since this conversion is
costly, the memory throughput is actually not important (no gain from
buffering). This leads to the visible slowdown.  This is of course a bit
annoying, but not sure how you would solve it (Have the dtype signal
that it doesn't even want iteration order optimization? Try to do move
those weird float16 conversations from the ufunc to the iterator
somehow?).
...
I have pushed this really coarse setup (based on some elderly copy of
pandas' vbench) to
https://github.com/yarikoptic/numpy-vbench
if you care to tune it up/extend and then I could fire it up again on
that box (which doesn't do anything else ATM AFAIK).   Since majority of
time is spent actually building it (did it with ccache though) it would
be neat if you come up with more of benchmarks to run which you might
think could be interesting/important.
I think this is pretty cool! Probably would be a while until there are
many tests, but if you or someone could set such thing up it could
slowly grow when larger code changes are done?

Regards,

Sebastian