
On Mon, 2013-05-06 at 10:32 -0400, Yaroslav Halchenko wrote:
On Wed, 01 May 2013, Sebastian Berg wrote:
btw -- is there something like panda's vbench for numpy? i.e. where it would be possible to track/visualize such performance improvements/hits?
Sorry if it seemed harsh, but only skimmed mails and it seemed a bit like the an obvious piece was missing... There are no benchmark tests I am aware of. You can try:
a = np.random.random((1000, 1000))
and then time a.sum(1) and a.sum(0), on 1.7. the fast axis (1), is only slightly faster then the sum over the slow axis. On earlier numpy versions you will probably see something like half the speed for the slow axis (only got ancient or 1.7 numpy right now, so reluctant to give exact timings).
FWIW -- just as a cruel first attempt look at
http://www.onerussian.com/tmp/numpy-vbench-20130506/vb_vb_reduce.html
why float16 case is so special?
Float16 is special, it is cpu-bound -- not memory bound as most reductions -- because it is not a native type. First thought it was weird, but it actually makes sense, if you have a and b as float16: a + b is actually more like (I believe...): float16(float32(a) + float32(b)) This means there is type casting going on *inside* the ufunc! Normally casting is handled outside the ufunc (by the buffered iterator). Now I did not check, but when the iteration order is not optimized, the ufunc *can* simplify this to something similar to this (along the reduction axis): result = float32(a[0]) for i in xrange(a[1:]): result += float32(a.next()) return float16(result) While for "optimized" iteration order, this cannot happen because the intermediate result is always written back. This means for optimized iteration order a single conversion to float is necessary (in the inner loop), while for unoptimized iteration order two conversions to float and one back is done. Since this conversion is costly, the memory throughput is actually not important (no gain from buffering). This leads to the visible slowdown. This is of course a bit annoying, but not sure how you would solve it (Have the dtype signal that it doesn't even want iteration order optimization? Try to do move those weird float16 conversations from the ufunc to the iterator somehow?).
I have pushed this really coarse setup (based on some elderly copy of pandas' vbench) to https://github.com/yarikoptic/numpy-vbench
if you care to tune it up/extend and then I could fire it up again on that box (which doesn't do anything else ATM AFAIK). Since majority of time is spent actually building it (did it with ccache though) it would be neat if you come up with more of benchmarks to run which you might think could be interesting/important.
I think this is pretty cool! Probably would be a while until there are many tests, but if you or someone could set such thing up it could slowly grow when larger code changes are done? Regards, Sebastian