On Wed, Oct 5, 2016 at 5:36 PM, Robert McLeod <robbmcleod@gmail.com> wrote:

It's certainly true that numexpr doesn't create a lot of OP_COPY operations, rather it's optimized to minimize them, so probably it's fewer ops than naive successive calls to numpy within python, but I'm unsure if there's any difference in operation count between a hand-optimized numpy with out= set and numexpr.  Numexpr just does it for you.

That was my understanding as well. If it automatically does what one could achieve by carrying the state along in the 'out' parameter, that's as good as it can get in terms removing unnecessary ops. There are other speedup opportunities of course, but that's a separate matter.
This blog post from Tim Hochberg is useful for understanding the performance advantages of blocking versus multithreading:

Hadnt come across that one before. Great link. Thanks. using caches and vector registers well trumps threading, unless one has a lot of data and it helps to disable hyper-threading.