It's certainly true that numexpr doesn't create a lot of OP_COPY operations, rather it's optimized to minimize them, so probably it's fewer ops than naive successive calls to numpy within python, but I'm unsure if there's any difference in operation count between a hand-optimized numpy with out= set and numexpr. Numexpr just does it for you.
This blog post from Tim Hochberg is useful for understanding the performance advantages of blocking versus multithreading: