Aah, great to find you on the scikit-image mailing list! I can certainly learn a lot from you. =)
Coming to the aspect you pointed out, I was thinking if we could probably get rid of `valid_template` all together from this evaluation and introduce another einsum product of `ssd` with `valid_mask`?
For example, this is what we are doing,
`c*a**2 + c*b**2 - 2c*a*b`, which really is equivalent to `c*(a-b)**2`.
However, this does add another call to einsum. And I had performed tests with about 500 pixels which amount to 1500 einsum calls. The bottleneck turns out to be einsum calls. (
PR) This is why I refrained from adding another einsum call, since I'd have 2000 calls then. However, your tests indicate that calls with 2 parameters do run considerably faster, so may be I'll go ahead and make this change ? What are your thoughts on this ?