Dec. 6, 2019
5:55 p.m.
However, surprisingly, the function above was ~30% slower for an img of shape 256x256x256 when I tried it. I would guess perhaps your implementation has more favorable cache/memory access pattern that improves the performance despite the theoretically higher number of FLOPS required.
It is quite common as a problem: When using numpy arrays, moving data is often more expensive than the calculation themselves as arrays no more fits into the cache.