Hi Chintak,

You have processed the StackOverflow answers impresively well: great blog post!

Just a quick note on performance of np.einsum. I have found that it performs much better when handed only two parameters. So you may want to benchmark whether applying the mask to the template before the call to np.einsum makes your code run faster. I don't think there is a way out of this 3 parameter call:

ssd += np.einsum('ijkl, ijkl, kl->ij', y, y, valid_mask)

But there is a good chance that:

ssd = np.einsum('ijkl, kl, kl->ij', y, template, valid_mask, dtype=np.float)

runs noticeably faster as:

ssd = np.einsum('ijkl, kl->ij', y, template*valid_mask, dtype=np.float)

A quick test on my system, with a 1000x1000 image and a 9x9 template and mask, all of floats, show it's 25% faster. And this is where about half of your processing time is being spent, so that little change would give you a 10% performance boost for free in this particular case. You may want to test a wider variety of parameter sizes, to see if the improvement holds.

The third call to np.einsum has a negligible impact on overall performance, but if you store the value of template*valid_mask, it also runs faster with a two parameter call, i.e. as:

ssd += np.einsum('ij, ij', template, cached_template_times_valid_mask)


( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.