Strange and hard to reproduce crash
Travis Oliphant
oliphant.travis at ieee.org
Mon Oct 23 19:05:53 EDT 2006
Fernando Perez wrote:
> On 10/23/06, Travis Oliphant <oliphant.travis at ieee.org> wrote:
>
>> Fernando Perez wrote:
>>
>>> Hi all,
>>>
>>> two colleagues have been seeing occasional crashes from very
>>> long-running code which uses numpy. We've now gotten a backtrace from
>>> one such crash, unfortunately it uses a build from a few days ago:
>>>
>>>
>> This looks like a reference-count problem on the data-type objects
>> (probably one of the builtin ones is trying to be released). The
>> reference count problem is probably hard to track down.
>>
>> A quick fix is to not allow the built-ins to be "freed" (the attempt
>> should never be made, but if it is, then we should just incref the
>> reference count and continue rather than die).
>>
>> Ideally, the reference count problem should be found, but other-wise
>> I'll just insert some print statements if the attempt is made, but not
>> actually do it as a safety measure.
>>
>
> If you point me to the right place in the sources, I'll be happy to
> add something to my local copy, rebuild numpy and rerun with these
> print statements in place.
>
I've placed them in SVN (r3384):
arraydescr_dealloc needs to do something like.
if (self->fields == Py_None) {
print something
incref(self)
return;
}
Most likely there is a missing Py_INCREF() before some call that uses
the data-type object (and consumes it's reference count) --- do you have
any Pyrex code (it's harder to get it right with Pyrex).
> I realize this is probably a very difficult problem to track down, but
> it really sucks to run a code for 4 days only to have it explode at
> the end. Right now this is starting to be a serious problem for us as
> we move our codes into large production runs, so I'm willing to put in
> the necessary effort to track it down, though I'll need some guidance
> from our gurus.
>
Tracking the reference count of the built-in data-type objects should
not be too difficult. First, figure out which one is causing problems
(if you still have the gdb traceback, then go up to the
arraydescr_dealloc function and look at self->type_num and self->type).
Then, put print statements throughout your code for the reference count
of this data-type object.
Something like,
sys.getrefcount(numpy.dtype('float'))
would be enough at a looping point in your code.
Good luck,
-Travis
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
More information about the NumPy-Discussion
mailing list