Hi Juan,
The transposition is almost certainly due to the fact that in the Cython implementation the p loop, which loops rows of the moments array, uses the dc multiplier, which comes from the image columns, while the q loop, which loops columns of the moments array, uses the dr multiplier, which comes from the image rows. Your other code chooses the multiplier corresponding to the same axes in both the image and the moments arrays. It is not obvious to me how to extend the current behavior to higher dimensions.
As for your implementation, using np.nonzero is probably not going to pay off unless your image is very sparse, because you basically go from having to handle a single value per image pixel, to add an extra coordinate value for each dimension. And you don't really need to compute an offset for each pixel, since many pixels share the offsets along some dimensions, e.g. for a 100 x 100 x 100 3D image, this code is 2-3x faster:
import itertools
def moments_central(image, center=None, order=3):
image = np.asarray(image)
if center is None:
center = np.array(image.shape, dtype=float) / 2
center = np.asarray(center)
assert center.ndim == 1 and center.size == image.ndim
# deltas is a list of arrays holding the deltas from the center along each
# dimension of the image.
deltas = [np.arange(image.shape[dim], dtype=float) - center[dim]
for dim in range(image.ndim)]
# Reshape each delta so that it can properly broadcast against image.
deltas = [delta.reshape((1,) * dim + (-1,) + (1,) * (image.ndim - dim - 1))
for dim, delta in enumerate(deltas)]
moments = np.empty((order + 1,) * image.ndim, dtype=float)
reusable_buffer = np.empty_like(image, dtype=float)
for powers in itertools.product(range(order + 1), repeat=image.ndim):
reusable_buffer[:] = image
for power, delta in zip(powers, deltas):
reusable_buffer *= delta ** power
moments[tuple(powers)] = reusable_buffer.sum()
return moments
Notice that this just iterates over the moments items. Broadcasting in this type of problems is dangerous, because you basically replace an N item loop with creating an arrays N times larger than your original data. And Python loops are slow, but not THAT slow. You can actually make the above idea run even faster, almost 2x, by reducing the dimensionality of your result as soon as possible earlier:
import itertools
def moments_central_faster(image, center=None, order=3):
image = np.asarray(image)
if center is None:
center = np.array(image.shape, dtype=float) / 2
center = np.asarray(center)
assert center.ndim == 1 and center.size == image.ndim
# deltas is a list of arrays holding the deltas from the center along each
# dimension of the image.
deltas = [np.arange(image.shape[dim], dtype=float) - center[dim]
for dim in range(image.ndim)]
# Reshape each delta so that it can properly broadcast against image.
deltas = [delta.reshape((1,) * dim + (-1,) + (1,) * (image.ndim - dim - 1))
for dim, delta in enumerate(deltas)]
moments = np.empty((order + 1,) * image.ndim, dtype=float)
for powers in itertools.product(range(order + 1), repeat=image.ndim):
calc = image.copy()
for dim, (power, delta) in enumerate(zip(powers, deltas)):
calc *= delta ** power
calc = np.add.reduce(calc, axis=dim, keepdims=True)
moments[tuple(powers)] = calc.sum()
return moments
This idea of reducing dimensionality can be given a twist in the form of stacked matrix multiplication: one can think of the whole calculation as reducing each dimension of size D to O, where O is the order plus one, by multiplying by a DXO matrix, holding the powers of the deltas along that dimension. In code it could look something like this:
def moments_central_even_faster(image, center=None, order=3):
image = np.asarray(image)
if center is None:
center = np.array(image.shape, dtype=float) / 2
center = np.asarray(center)
assert center.ndim == 1 and center.size == image.ndim
# deltas is a list of arrays holding the deltas from the center along each
# dimension of the image.
deltas = [np.arange(image.shape[dim], dtype=float) - center[dim]
for dim in range(image.ndim)]
calc = image
for dim, delta in enumerate(deltas):
powers_of_delta = delta[:, None] ** np.arange(order + 1)
calc = np.rollaxis(calc, dim, image.ndim)
calc = np.dot(calc, powers_of_delta)
calc = np.rollaxis(calc, -1, dim)
return calc
This turns out to be ridiculously fast for the 100 x 100 x 100 image case (1.41 s vs 520 ms vs 258 ms vs 10.5 ms), so over 100x (!!!) faster than your implementation.
On a 1000 x 1000 2D image it's fast, but not the fastest (571 ms vs 122 ms vs 92.8 ms vs 113 ms), but the fastest version is probably in the same ballpark as the current Cython implementation.
Let me know how do your timings compare.
Jaime