Re: [Numpy-discussion] Ticket review: #848, leak in PyArray_DescrFromType

18 Jul 2008

I'm afraid this is going to be a long one, and I don't see any good way to 
cut down the quoted text either...

Charles, I'm going to plea with you to read what I've just written and 
think about it.  I'm trying to make the case as clear as I can.  I think 
the case is actually extremely simple: the existing @name@_arrtype_name 
code is broken.


On Thu, 17 Jul 2008, Charles R Harris wrote:
...
On Tue, Jul 15, 2008 at 9:28 AM, Michael Abbott 
wrote:
...
...

--- numpy/core/src/scalartypes.inc.src  (revision 5411)
+++ numpy/core/src/scalartypes.inc.src  (working copy)
@@ -1925,19 +1925,30 @@
         goto finish;
     }
+    Py_XINCREF(typecode);
     arr = PyArray_FromAny(obj, typecode, 0, 0, FORCECAST, NULL);
-    if ((arr==NULL) || (PyArray_NDIM(arr) > 0)) return arr;
+    if ((arr==NULL) || (PyArray_NDIM(arr) > 0)) {
+        Py_XDECREF(typecode);
+        return arr;
+    }
     robj = PyArray_Return((PyArrayObject *)arr);
finish:
-    if ((robj==NULL) || (robj->ob_type == type)) return robj;
+    if ((robj==NULL) || (robj->ob_type == type)) {
+        Py_XDECREF(typecode);
+        return robj;
+    }
     /* Need to allocate new type and copy data-area over */
     if (type->tp_itemsize) {
         itemsize = PyString_GET_SIZE(robj);
     }
     else itemsize = 0;
     obj = type->tp_alloc(type, itemsize);
-    if (obj == NULL) {Py_DECREF(robj); return NULL;}
+    if (obj == NULL) {
+        Py_XDECREF(typecode);
+        Py_DECREF(robj);
+        return NULL;
+    }
     if (typecode==NULL)
         typecode = PyArray_DescrFromType(PyArray_@TYPE@);
     dest = scalar_value(obj, typecode);
     src = scalar_value(robj, typecode);
     Py_DECREF(typecode);
Ahah.  That DECREF balances the original PyArray_DescrFromType, or maybe
the later call ... and of course this has to happen on *ALL* return paths.
If we now take a closer look at the patch we can see that it's doing two
separate things:
1. There's an extra Py_XINCREF to balance the ref count lost to
PyArray_FromAny and ensure that typecode survives long enough;
2. Every early return path has an extra Py_XDECREF to balance the creation
of typecode.
I rest my case for this patch.
I still haven't convinced myself of this.
Seriously?  

Please bear with me for a bit then. 

Let me try and go back to basics.  Reference counting discipline should be 
very simple (I did post up some links on my Wed, 9 Jul 2008 08:36:27 
posting on this list).  Let me try and capture this in the following 
principles:

1. Each routine is wholly responsible for the reference counts it creates.  
This responsibility can be fulfilled by:
 1.a Decrementing the reference count.
 1.b Returning the object to the caller (which accounts for exactly one 
     reference count).
 1.c Explicitly placing the object in another reference counted structure 
     (similarly this accounts for another reference count, and of course 
     creates some extra obligations which I won't discuss)
 1.d Pass the object to a routine which indulges in "reference count 
     theft".  In some sense this can be thought of as an example of 1.c, 
     and this is how PyList_SetItem and PyTuple_SetItem behave.

2. Reference counts are created by:
 2.a Calling a routine which returns an object.
 2.b Explicitly incrementing the reference count.

3. When the known reference count on an object reaches zero the object 
must be treated as lost (and any further reference to it will be 
undefined).  In particular, a routine cannot make any assumptions about 
reference counts it has not created.

I need to make a bit of a digression on the subject of reference count 
theft.  I am only aware of three routines that do this:
	PyList_SetItem, PyTuple_SetItem and PyArray_FromAny
The first two are effectively assigning the reference into a preexisting 
structure, so can be regarded as instances of principle 1.c.  The last is 
a disaster area.  I know (by inspecting the code) that PyArray_FromAny may 
choose to erase its typecode argument (by decrementing the reference), but 
I can't tell (without digging deeper than I care to go) whether this only 
occurs on the NULL return branch.
    However, this is a bit of a red herring.  If we recognise that 
PyArray_FromAny steals the reference count, the remaining analysis will go 
through.

A further note on 1.c: when an object is placed in another reference 
counted structure its reference count should normally be incremented -- 
case 1.c is really just an elision of this increment with the decrement 
under obligation 1.a.  This normal case can be seen half way through 
PyArray_Scalar.


Ok, preliminaries over: now to look at the code.  Here is a caricature of 
the routine @name@_arrtype_new with the important features isolated:

1	PyArray_Descr *typecode = NULL;
2	if (...) goto finish;		// _WORK@work@
3	typecode = PyArray_DescrFromType(...);
4	if (...) { ...typecode...; goto finish; }
5	arr = PyArray_FromAny(..., typecode, ...);
6	if (...) return arr;
7	finish:
8	if (...) return robj;
9	if (...) return NULL;
10	if (typecode == NULL) typecode = PyArray_DescrFromType(...);
11	... scalar_value(..., typecode);...
12	Py_DECREF(typecode);

So let's go through this line by line.

1.	We start with NULL typecode.  Fine.
2.	A hidden goto in a macro.  Sweet.  Let's have more code like this.
3.	Here's the reference count we're responsible for.
4.	If obj is NULL we use the typecode 
5.	 otherwise we pass it to PyArray_FromAny.
6.	The first early return
7.	All paths (apart from 6) come together here.

So at this point let's take stock.  typecode is in one of three states: 
NULL (path 2, or if creation failed), allocated with a single reference 
count (path 4), or lost (path 5).  This is not good.

LET ME EMPHASISE THIS: the state of the code at the finish label is 
dangerous and simply broken.

The original state at the finish label is indeterminate: typecode has 
either been lost by passing it to PyArray_FromAny (in which case we're not 
allowed to touch it again), or else it has reference count that we're 
still responsible for.

There seems to be a fantasy expressed in a comment in a recent update to 
this routine that PyArray_Scalar has stolen a reference, but fortunately a 
quick code inspection (of arrayobject.c) quickly refutes this frightening 
possibility.

So, the only way to fix the problem at (7) is to unify the two non-NULL 
cases.  One answer is to add a DECREF at (4), but we see at (11) that we 
still need typecode at (7) -- so the only solution is to add an extra 
ADDREF just before (5).  This then of course sadly means that we also need 
an extra DECREF at (6).

PLEASE don't suggest moving the ADDREF until after (6) -- at this point 
typecode is lost and may have been destroyed, and relying on any 
possibility to the contrary is a recipe for continued screw ups.

The rest is easy.  Once we've established the invariant that typecode is 
either NULL or has a single reference count at (7) then the two early 
returns at (8) and (9) unfortunately need to be augmented with DECREFs.

And we're done.


Responses to your original comments:
...
By the time we hit finish, robj is NULL or holds a reference to typecode 
and the NULL case is taken care of up front.
robj has nothing to do with the lifetime management of typecode, the only 
issue is the early return.  After the finish label typecode is either NULL 
(no problem) or else has a single reference count that needs to be 
accounted for.
...
Later on, the reference to typecode might be decremented,
That *might* is at the heart of the problem.  You can't be so cavalier 
about managing references.
...
perhaps leaving robj crippled, but in that case robj itself is marked 
for deletion upon exit.
Please ignore robj in ths discussion, it's beside the point.
...
If the garbage collector can handle zero reference counts I think
we are alright.
No, no, no.  This is nothing to do with the garbage collector.  If we 
screw up our reference counts here then the garbage collector isn't going 
to dig us out of the hole.
...
I admit I haven't quite followed all the subroutines and
macros, which descend into the hazy depths without the slightest bit of
documentation, but at this point I'm inclined to leave things alone unless
you have a test that shows a leak from this source.
Part of my point is that proper reference count discipline should not 
require any descent into subroutines (except for the very nasty case of 
reference theft, which I think is generally agreed to be a bad thing).
As for the test case, try this one (you'll need a debug build):

import numpy
import sys
refs = 0
r = range(100)
refs = sys.gettotalrefcount()
for i in r: float32()
print sys.gettotalrefcount() - refs

You should get one leak per float32() call.


I've just noticed, looking at the latest revision of the code, that 
somebody seems under the misapprehension that PyArray_Scalar steals 
reference counts.  Fortunately this is not true.  Maybe this is part of 
the confusion?