On Fri, Jul 18, 2008 at 5:15 AM, Michael Abbott <michael@araneidae.co.uk> wrote:

I'm afraid this is going to be a long one, and I don't see any good way to
cut down the quoted text either...

Charles, I'm going to plea with you to read what I've just written and
think about it. I'm trying to make the case as clear as I can. I think
the case is actually extremely simple: the existing @name@_arrtype_name
code is broken.

<snip>

Heh. As the macro is undefined after the code is generated, it should probably be moved into the code.
I would actually like to get rid of the ifdef's (almost everywhere), but that is a later stage of cleanup.

3. Here's the reference count we're responsible for.

Yep.

4. If obj is NULL we use the typecode
5. otherwise we pass it to PyArray_FromAny.
6. The first early return
7. All paths (apart from 6) come together here.

So at this point let's take stock. typecode is in one of three states:
NULL (path 2, or if creation failed), allocated with a single reference
count (path 4), or lost (path 5). This is not good.

It still has a single reference after 5 if PyArray_FromAny succeeded, that reference is held by arr and transferred to robj. If the transfer fails, the reference to arr is decremented and NULL returned by PyArray_Return. When arr is garbage collected the reference to typecode will be decremented.

LET ME EMPHASISE THIS: the state of the code at the finish label is
dangerous and simply broken.

The original state at the finish label is indeterminate: typecode has
either been lost by passing it to PyArray_FromAny (in which case we're not
allowed to touch it again), or else it has reference count that we're
still responsible for.

There seems to be a fantasy expressed in a comment in a recent update to
this routine that PyArray_Scalar has stolen a reference, but fortunately a
quick code inspection (of arrayobject.c) quickly refutes this frightening
possibility.

No, no, Pyarray_Scalar doesn't do anything to the reference count. Where did you see otherwise?

So, the only way to fix the problem at (7) is to unify the two non-NULL
cases. One answer is to add a DECREF at (4), but we see at (11) that we
still need typecode at (7) -- so the only solution is to add an extra
ADDREF just before (5). This then of course sadly means that we also need
an extra DECREF at (6).

PLEASE don't suggest moving the ADDREF until after (6) -- at this point
typecode is lost and may have been destroyed, and relying on any
possibility to the contrary is a recipe for continued screw ups.

The rest is easy. Once we've established the invariant that typecode is
either NULL or has a single reference count at (7) then the two early
returns at (8) and (9) unfortunately need to be augmented with DECREFs.

And we're done.

Responses to your original comments:

> By the time we hit finish, robj is NULL or holds a reference to typecode
> and the NULL case is taken care of up front.

robj has nothing to do with the lifetime management of typecode, the only
issue is the early return. After the finish label typecode is either NULL
(no problem) or else has a single reference count that needs to be
accounted for.

> Later on, the reference to typecode might be decremented,

That *might* is at the heart of the problem. You can't be so cavalier
about managing references.

> perhaps leaving robj crippled, but in that case robj itself is marked
> for deletion upon exit.

Please ignore robj in ths discussion, it's beside the point.

> If the garbage collector can handle zero reference counts I think
> we are alright.

No, no, no. This is nothing to do with the garbage collector. If we
screw up our reference counts here then the garbage collector isn't going
to dig us out of the hole.

The garbage collector destroys the object and should decrement all the references it holds. If that is not the case then there are bigger problems afoot. The finalizer for the object should hold the knowledge of what needs to be decremented.

> I admit I haven't quite followed all the subroutines and
> macros, which descend into the hazy depths without the slightest bit of
> documentation, but at this point I'm inclined to leave things alone unless
> you have a test that shows a leak from this source.

Part of my point is that proper reference count discipline should not
require any descent into subroutines (except for the very nasty case of
reference theft, which I think is generally agreed to be a bad thing).

Agreed. But that is not the code we are looking at. My personal schedule for this sort of cleanup/refactoring looks like this.

1) Format the code into readable C. (ongoing)
2) Document the functions so we know what they do.
3) Understand the code.
4) Fix up functions starting from the bottom layers.
5) Flatten the code -- the calls go too deep for my taste and make understanding difficult. My attempts to generate a call graph have all run into problems.

At that point consider the reference counting model.

There aren't many people working with the C code apart from myself and Travis (who wrote the original), so I want to encourage you to keep looking at the code and working with it. However, we can't just start from scratch and we have to be pretty conservative to keep from breaking things.

As for the test case, try this one (you'll need a debug build):

import numpy
import sys
refs = 0
r = range(100)
refs = sys.gettotalrefcount()
for i in r: float32()
print sys.gettotalrefcount() - refs

Simpler test case:

import sys, gc
import numpy as np

def main() :
    t = np.dtype(np.float32)
    print sys.getrefcount(t)
    for i in range(100) :
        np.float32()
        gc.collect()
    print sys.getrefcount(t)

if __name__ == "__main__" :
    main()

Result

$[charris@f8 ~]$ python debug.py
5
105

So there is a leak. The question is the proper fix. I want to take a closer look at PyArray_Return and also float32() and relations.

Chuck