[Numpy-discussion] Really cruel draft of vbench setup for NumPy (.add.reduce benchmarks since 2011)

Sebastian Berg sebastian at sipsolutions.net
Mon May 6 12:03:52 EDT 2013


On Mon, 2013-05-06 at 10:32 -0400, Yaroslav Halchenko wrote:
> On Wed, 01 May 2013, Sebastian Berg wrote:
> > > btw -- is there something like panda's vbench for numpy?  i.e. where
> > > it would be possible to track/visualize such performance
> > > improvements/hits?
> 
> 
> > Sorry if it seemed harsh, but only skimmed mails and it seemed a bit
> > like the an obvious piece was missing... There are no benchmark tests I
> > am aware of. You can try:
> 
> > a = np.random.random((1000, 1000))
> 
> > and then time a.sum(1) and a.sum(0), on 1.7. the fast axis (1), is only
> > slightly faster then the sum over the slow axis. On earlier numpy
> > versions you will probably see something like half the speed for the
> > slow axis (only got ancient or 1.7 numpy right now, so reluctant to give
> > exact timings).
> 
> FWIW -- just as a cruel first attempt look at
> 
> http://www.onerussian.com/tmp/numpy-vbench-20130506/vb_vb_reduce.html
> 
> why float16 case is so special?

Float16 is special, it is cpu-bound -- not memory bound as most
reductions -- because it is not a native type. First thought it
was weird, but it actually makes sense, if you have a and b as float16:

a + b

is actually more like (I believe...):

float16(float32(a) + float32(b))

This means there is type casting going on *inside* the ufunc! Normally
casting is handled outside the ufunc (by the buffered iterator).  Now I
did not check, but when the iteration order is not optimized, the ufunc
*can* simplify this to something similar to this (along the reduction
axis):

result = float32(a[0])
for i in xrange(a[1:]):
    result += float32(a.next())
return float16(result)

While for "optimized" iteration order, this cannot happen because the
intermediate result is always written back.

This means for optimized iteration order a single conversion to float is
necessary (in the inner loop), while for unoptimized iteration order two
conversions to float and one back is done.  Since this conversion is
costly, the memory throughput is actually not important (no gain from
buffering). This leads to the visible slowdown.  This is of course a bit
annoying, but not sure how you would solve it (Have the dtype signal
that it doesn't even want iteration order optimization? Try to do move
those weird float16 conversations from the ufunc to the iterator
somehow?).

> 
> I have pushed this really coarse setup (based on some elderly copy of
> pandas' vbench) to
> https://github.com/yarikoptic/numpy-vbench
> 
> if you care to tune it up/extend and then I could fire it up again on
> that box (which doesn't do anything else ATM AFAIK).   Since majority of
> time is spent actually building it (did it with ccache though) it would
> be neat if you come up with more of benchmarks to run which you might
> think could be interesting/important.
> 

I think this is pretty cool! Probably would be a while until there are
many tests, but if you or someone could set such thing up it could
slowly grow when larger code changes are done?

Regards,

Sebastian





More information about the NumPy-Discussion mailing list