I am curious about exploring whether or not we could add simple blocked
iteration to NumPy.
It seems like a long standing small deficiency in NumPy that we do not
support blocked iteration. I do not know how much speed gain we would
actually have in real world code, but I assume some bad-memory-order
copies could be drastically faster.
Implementing blocked iteration for NumPy seems pretty complicated on
first sight due to the complexity of the iterator and the fact that
almost no-one knows the code well or touches it regularly.
But, the actual complexity to add a new iteration mode to it is
probably not forbiddingly high.
First, we need to (quickly) find the cases where blocked iteration
makes sense, and then, if it does, store whatever additional metadata
Second, we need to provide a newly implemented `iternext` function.
The first chunk, seems like it can be done in its own function and
should be fairly straight forward to do after most of the iterator
setup is done.
While the second part is already how the iterator is designed.
It would be helpful to have someone with some expertise/brackground in
this type of thing to be able discuss trade-offs and quickly see what
the main goals should be and whether/where significant performance
increases are likely.
There are some things that may end up being complicated. For example,
if we would want to support reductions/broadcasting. However, it may
well be that there is no reason to attack those more complex cases,
because the largest gains are expected elsewhere in any case.
I would be extremely happy if anyone with the necessary background is
interested in giving this challenge a try. I can help with the NumPy-
API side, and code review and would be available for chatting/helping
with the NumPy side. I do not have the bandwidth to actually dive into
this for real though.
 That is the complexity concerning the NumPy API. I do not know how
complex a blocked iterator itself is.