On Mar 12, 2020, at 12:33, Eric Wieser <wieser.eric+numpy@gmail.com> wrote:

And as I understand it (from a quick scan) the reason it can’t tell isn’t that the refcount isn’t 1 (which is something CPython could maybe fix if it were a problem, but it isn’t). Rather, it is already 1, but a refcount of 1 doesn’t actually prove a temporary value

This is at odds with my undestanding, which is that the reason it can’t tell _is_ that the refcount isn’t 1. When you do `(b * c) + d`, `(b * c).__add__(d)` is passed `self` with a refcount of 1 (on the stack). I believe this hits the optimization today in numpy. When you do `bc + d`, `bc.__add__(d)` is passed `self` with a refcount of 2 (one on the stack, and one in `locals()`).

The GitHub issue you referenced is about optimizing this:

    r = -a + b * c**2

The explanation of the problem is that “Temporary arrays generates in expressions are expensive”. And the detail says:

> The tricky part is that via the C-API one can use the PyNumber_ directly
> and skip increasing the reference count when not needed.

In other words, in their example, in the `b * c**2`, even though both `b` and the result of `c**2` pass into `type(b).__add__` with a refcount of 1, they can’t actually know that it’s safe to treat either one as a temporary, because if the caller of `__add__` or anything higher up the stack is a C extension function, it could have a still-alive-in-C PyObject* that refers to one of those values despite the refcount being 1.

That’s why they have to walk the stack to see if there are any C functions (besides numpy functions that they know play nice): if not, they can assume refcount==1 means temporary, but otherwise, they can’t.

My proposal of del is that `(del bc) + d` would enter `bc.__add__(d)` with `self` passed with a refcount of 1.

So your proposal doesn’t help their problem. At best, it gives them the same behavior they already have, which they still need to optimize.