Strange and hard to reproduce crash
![](https://secure.gravatar.com/avatar/95198572b00e5fbcd97fb5315215bf7a.jpg?s=120&d=mm&r=g)
Hi all, two colleagues have been seeing occasional crashes from very long-running code which uses numpy. We've now gotten a backtrace from one such crash, unfortunately it uses a build from a few days ago: In [3]: numpy.__version__ Out[3]: '1.0b5.dev3097' In [4]: scipy.__version__ Out[4]: '0.5.0.2180' Because it takes so long to get the code to crash (several days of 100%CPU usage), I can't make a new one right now, but I'll be happy to restart the same run with a current SVN build if necessary, and post the results in a few days. In the meantime, here's a gdb backtrace we were able to get by setting MALLOC_CHECK_ to 2 and running the python process from within gdb: Program received signal SIGABRT, Aborted. [Switching to Thread 1073880896 (LWP 26280)] 0x40000402 in __kernel_vsyscall () (gdb) bt #0 0x40000402 in __kernel_vsyscall () #1 0x0042c7d5 in raise () from /lib/tls/libc.so.6 #2 0x0042e149 in abort () from /lib/tls/libc.so.6 #3 0x0046b665 in free_check () from /lib/tls/libc.so.6 #4 0x00466e65 in free () from /lib/tls/libc.so.6 #5 0x005a4ab7 in PyObject_Free () from /usr/lib/libpython2.3.so.1.0 #6 0x403f6336 in arraydescr_dealloc (self=0x40424020) at arrayobject.c:10455 #7 0x403fab3e in PyArray_FromArray (arr=0xe081cb0, newtype=0x40424020, flags=0) at arrayobject.c:7725 #8 0x403facc3 in PyArray_FromAny (op=0xe081cb0, newtype=0x0, min_depth=0, max_depth=0, flags=0, context=0x0) at arrayobject.c:8178 #9 0x4043bc45 in PyUFunc_GenericFunction (self=0x943a660, args=0xa9dbf2c, mps=0xbfc83730) at ufuncobject.c:906 #10 0x40440a04 in ufunc_generic_call (self=0x943a660, args=0xa9dbf2c) at ufuncobject.c:2742 #11 0x0057d607 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0 #12 0x0057d6d4 in PyObject_CallFunction () from /usr/lib/libpython2.3.so.1.0 #13 0x403eabb6 in PyArray_GenericBinaryFunction (m1=Variable "m1" is not available. ) at arrayobject.c:3296 #14 0x0057b7e1 in PyNumber_Check () from /usr/lib/libpython2.3.so.1.0 #15 0x0057c1e0 in PyNumber_Multiply () from /usr/lib/libpython2.3.so.1.0 #16 0x005d16a3 in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0 #17 0x005d509e in PyEval_EvalCodeEx () from /usr/lib/libpython2.3.so.1.0 #18 0x005d3d8f in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0 #19 0x005d509e in PyEval_EvalCodeEx () from /usr/lib/libpython2.3.so.1.0 #20 0x00590e2e in PyFunction_SetClosure () from /usr/lib/libpython2.3.so.1.0 #21 0x0057d607 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0 #22 0x00584d98 in PyMethod_New () from /usr/lib/libpython2.3.so.1.0 #23 0x0057d607 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0 #24 0x005b584c in _PyObject_SlotCompare () from /usr/lib/libpython2.3.so.1.0 #25 0x005aec2c in PyType_IsSubtype () from /usr/lib/libpython2.3.so.1.0 #26 0x0057d607 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0 #27 0x005d2b7f in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0 #28 0x005d509e in PyEval_EvalCodeEx () from /usr/lib/libpython2.3.so.1.0 #29 0x005d3d8f in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0 #30 0x005d509e in PyEval_EvalCodeEx () from /usr/lib/libpython2.3.so.1.0 #31 0x005d3d8f in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0 #32 0x005d497b in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0 #33 0x005d497b in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0 #34 0x005d497b in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0 #35 0x005d509e in PyEval_EvalCodeEx () from /usr/lib/libpython2.3.so.1.0 #36 0x005d5362 in PyEval_EvalCode () from /usr/lib/libpython2.3.so.1.0 #37 0x005ee817 in PyErr_Display () from /usr/lib/libpython2.3.so.1.0 #38 0x005ef942 in PyRun_SimpleFileExFlags () from /usr/lib/libpython2.3.so.1.0 #39 0x005f0994 in PyRun_AnyFileExFlags () from /usr/lib/libpython2.3.so.1.0 #40 0x005f568e in Py_Main () from /usr/lib/libpython2.3.so.1.0 #41 0x080485b2 in main () # End of BT. This code is running on a Fedora Core 3 box, with python 2.3.4 and numpy/scipy compiled using gcc 3.4.4. I realize that it's extremely difficult to help with so little information, but unfortunately we have no small test that can reproduce the problem. Only our large research codes, when running for multiple days on a single run, cause this. Even very intensive uses of the same code but which last only a few hours never show this. This code is a long-runing iterative algorithm, so it's basically applying the same (complex) loop over and over until convergence, using numpy and scipy pretty extensively throughout. If super Travis (or anyone else) can have a Eureka moment from the above backtrace, that would be fantastic. If there's any other information you think I may be able to provide, I'll be happy to do my best. Cheers, f ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
![](https://secure.gravatar.com/avatar/fa3279b202e9a85f7d90a0422bca4489.jpg?s=120&d=mm&r=g)
Hey Fernando Maybe you can give the code a spin under Valgrind. It's going to be slow, but if the crash is being caused by memory corruption that happens all the time as the process is running, maybe Valgrind will show it. You need some Valgrind suppressions for Python. It seems the 2.3 source tree didn't contain these yet, so try the one from trunk: http://svn.python.org/view/python/trunk/Misc/valgrind-python.supp?rev=47113& view=auto I then run Valgrind as follows: valgrind \ --tool=memcheck \ --leak-check=yes \ --error-limit=no \ --suppressions=valgrind-python.supp \ --num-callers=20 \ --freelist-vol=536870912 \ -v \ python foo.py I recommend using the latest Valgrind (3.2.1) from here: http://www.valgrind.org/downloads/current.html#current A build from source should be as simple as ./configure && make. Cheers, Albert
-----Original Message----- From: numpy-discussion-bounces@lists.sourceforge.net [mailto:numpy- discussion-bounces@lists.sourceforge.net] On Behalf Of Fernando Perez Sent: Monday, October 23, 2006 11:40 PM To: Discussion of Numerical Python Subject: [Numpy-discussion] Strange and hard to reproduce crash
Hi all,
two colleagues have been seeing occasional crashes from very long-running code which uses numpy. We've now gotten a backtrace from one such crash, unfortunately it uses a build from a few days ago:
In [3]: numpy.__version__ Out[3]: '1.0b5.dev3097'
In [4]: scipy.__version__ Out[4]: '0.5.0.2180'
Because it takes so long to get the code to crash (several days of 100%CPU usage), I can't make a new one right now, but I'll be happy to restart the same run with a current SVN build if necessary, and post the results in a few days.
<snip>
------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
![](https://secure.gravatar.com/avatar/95198572b00e5fbcd97fb5315215bf7a.jpg?s=120&d=mm&r=g)
On 10/23/06, Albert Strasheim <fullung@gmail.com> wrote:
Hey Fernando
Maybe you can give the code a spin under Valgrind. It's going to be slow, but if the crash is being caused by memory corruption that happens all the time as the process is running, maybe Valgrind will show it.
You need some Valgrind suppressions for Python. It seems the 2.3 source tree didn't contain these yet, so try the one from trunk:
[...] Thanks, Albert. I can give it a try, though it will probably take ages to run. This already requires 3-4 days of non-stop execution to cause a crash, and valgrind can make execution times go up by a factor of 10. I'd like to have some info before a month :) Cheers, f ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
![](https://secure.gravatar.com/avatar/49df8cd4b1b6056c727778925f86147a.jpg?s=120&d=mm&r=g)
Fernando Perez wrote:
Hi all,
two colleagues have been seeing occasional crashes from very long-running code which uses numpy. We've now gotten a backtrace from one such crash, unfortunately it uses a build from a few days ago:
This looks like a reference-count problem on the data-type objects (probably one of the builtin ones is trying to be released). The reference count problem is probably hard to track down. A quick fix is to not allow the built-ins to be "freed" (the attempt should never be made, but if it is, then we should just incref the reference count and continue rather than die). Ideally, the reference count problem should be found, but other-wise I'll just insert some print statements if the attempt is made, but not actually do it as a safety measure. -Travis ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
![](https://secure.gravatar.com/avatar/49df8cd4b1b6056c727778925f86147a.jpg?s=120&d=mm&r=g)
The long awaited day is coming....--- Wednesday is the target. Please submit problems before Tuesday (tomorrow). Nothing but bug-fixes are being changed right now. -Travis ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
![](https://secure.gravatar.com/avatar/fa3279b202e9a85f7d90a0422bca4489.jpg?s=120&d=mm&r=g)
Hey Travis
-----Original Message----- From: numpy-discussion-bounces@lists.sourceforge.net [mailto:numpy- discussion-bounces@lists.sourceforge.net] On Behalf Of Travis Oliphant Sent: Tuesday, October 24, 2006 12:32 AM To: Discussion of Numerical Python Subject: [Numpy-discussion] Release of 1.0 coming
The long awaited day is coming....--- Wednesday is the target.
Please submit problems before Tuesday (tomorrow). Nothing but bug-fixes are being changed right now.
Some Valgrind warnings that you might want to look at: http://projects.scipy.org/scipy/numpy/ticket/360 Maybe faltet could provide some code to reproduce this problem: http://projects.scipy.org/scipy/numpy/ticket/355 I think this ndpointer issue has been resolved (Stefan?): http://projects.scipy.org/scipy/numpy/ticket/340 I think ctypes 1.0.1 is required for ndpointer to work, so we might consider some kind of version check + warning on import? Maybe a Python at-exit handler can be used to avoid the add_docstring leaks described here: http://projects.scipy.org/scipy/numpy/ticket/195 Also, what's the story with f2py? It seems Pearu is still making quite a few changes in the trunk as part of F2PY G3. Cheers, Albert ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
![](https://secure.gravatar.com/avatar/49df8cd4b1b6056c727778925f86147a.jpg?s=120&d=mm&r=g)
Albert Strasheim wrote:
Hey Travis
-----Original Message----- From: numpy-discussion-bounces@lists.sourceforge.net [mailto:numpy- discussion-bounces@lists.sourceforge.net] On Behalf Of Travis Oliphant Sent: Tuesday, October 24, 2006 12:32 AM To: Discussion of Numerical Python Subject: [Numpy-discussion] Release of 1.0 coming
The long awaited day is coming....--- Wednesday is the target.
Please submit problems before Tuesday (tomorrow). Nothing but bug-fixes are being changed right now.
Some Valgrind warnings that you might want to look at: http://projects.scipy.org/scipy/numpy/ticket/360
fixed.
Maybe faltet could provide some code to reproduce this problem: http://projects.scipy.org/scipy/numpy/ticket/355
Looked at it and couldn't see what could be wrong. Need code to reproduce the problem.
I think this ndpointer issue has been resolved (Stefan?): http://projects.scipy.org/scipy/numpy/ticket/340
Yes it has. Fixed.
I think ctypes 1.0.1 is required for ndpointer to work, so we might consider some kind of version check + warning on import?
Not sure about that. It worked for me using ctypes 1.0.0.
Maybe a Python at-exit handler can be used to avoid the add_docstring leaks described here: http://projects.scipy.org/scipy/numpy/ticket/195
I'm not too concerned about this. Whether we release the memory right before exiting or just let the O.S. do it when the process quits seems rather immaterial. It would be a bit of work to implement so the cost / benefit ratio seems way to high.
Also, what's the story with f2py? It seems Pearu is still making quite a few changes in the trunk as part of F2PY G3.
Pearu told me not to hold up NumPy 1.0 because f2py g3 is still a ways away. His changes should not impact normal usage of f2py. I suspect NumPy 1.0.1 will contain f2py g3 -Travis ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
![](https://secure.gravatar.com/avatar/af6c39d6943bd4b0e1fde23161e7bb8c.jpg?s=120&d=mm&r=g)
On Mon, Oct 23, 2006 at 05:28:05PM -0600, Travis Oliphant wrote:
Yes it has. Fixed.
I think ctypes 1.0.1 is required for ndpointer to work, so we might consider some kind of version check + warning on import?
Not sure about that. It worked for me using ctypes 1.0.0.
You have to excercise ctypes beyond the normal unit tests for it to break (my code did, the moment the update went into numpy). I can confirm that it runs fine with ctypes 1.0.1. Regards Stéfan ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
![](https://secure.gravatar.com/avatar/9820b5956634e5bbad7f4ed91a232822.jpg?s=120&d=mm&r=g)
Albert Strasheim wrote:
Hey Travis
-----Original Message----- From: numpy-discussion-bounces@lists.sourceforge.net [mailto:numpy- discussion-bounces@lists.sourceforge.net] On Behalf Of Travis Oliphant Sent: Tuesday, October 24, 2006 12:32 AM To: Discussion of Numerical Python Subject: [Numpy-discussion] Release of 1.0 coming
The long awaited day is coming....--- Wednesday is the target.
Please submit problems before Tuesday (tomorrow). Nothing but bug-fixes are being changed right now.
Some Valgrind warnings that you might want to look at: http://projects.scipy.org/scipy/numpy/ticket/360
Maybe faltet could provide some code to reproduce this problem: http://projects.scipy.org/scipy/numpy/ticket/355
I think this ndpointer issue has been resolved (Stefan?): http://projects.scipy.org/scipy/numpy/ticket/340
I think ctypes 1.0.1 is required for ndpointer to work, so we might consider some kind of version check + warning on import?
Yes, please, I got caught on this one: ctype code not running anymore with SVN numpy. Updating ctypes from 1.0.0 to 1.0.1 did the trick, cheers, David
Maybe a Python at-exit handler can be used to avoid the add_docstring leaks described here: http://projects.scipy.org/scipy/numpy/ticket/195
Also, what's the story with f2py? It seems Pearu is still making quite a few changes in the trunk as part of F2PY G3.
Cheers,
Albert
------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
![](https://secure.gravatar.com/avatar/95198572b00e5fbcd97fb5315215bf7a.jpg?s=120&d=mm&r=g)
On 10/23/06, Travis Oliphant <oliphant.travis@ieee.org> wrote:
Fernando Perez wrote:
Hi all,
two colleagues have been seeing occasional crashes from very long-running code which uses numpy. We've now gotten a backtrace from one such crash, unfortunately it uses a build from a few days ago:
This looks like a reference-count problem on the data-type objects (probably one of the builtin ones is trying to be released). The reference count problem is probably hard to track down.
A quick fix is to not allow the built-ins to be "freed" (the attempt should never be made, but if it is, then we should just incref the reference count and continue rather than die).
Ideally, the reference count problem should be found, but other-wise I'll just insert some print statements if the attempt is made, but not actually do it as a safety measure.
If you point me to the right place in the sources, I'll be happy to add something to my local copy, rebuild numpy and rerun with these print statements in place. I realize this is probably a very difficult problem to track down, but it really sucks to run a code for 4 days only to have it explode at the end. Right now this is starting to be a serious problem for us as we move our codes into large production runs, so I'm willing to put in the necessary effort to track it down, though I'll need some guidance from our gurus. Cheers, f ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
![](https://secure.gravatar.com/avatar/49df8cd4b1b6056c727778925f86147a.jpg?s=120&d=mm&r=g)
Fernando Perez wrote:
On 10/23/06, Travis Oliphant <oliphant.travis@ieee.org> wrote:
Fernando Perez wrote:
Hi all,
two colleagues have been seeing occasional crashes from very long-running code which uses numpy. We've now gotten a backtrace from one such crash, unfortunately it uses a build from a few days ago:
This looks like a reference-count problem on the data-type objects (probably one of the builtin ones is trying to be released). The reference count problem is probably hard to track down.
A quick fix is to not allow the built-ins to be "freed" (the attempt should never be made, but if it is, then we should just incref the reference count and continue rather than die).
Ideally, the reference count problem should be found, but other-wise I'll just insert some print statements if the attempt is made, but not actually do it as a safety measure.
If you point me to the right place in the sources, I'll be happy to add something to my local copy, rebuild numpy and rerun with these print statements in place.
I've placed them in SVN (r3384): arraydescr_dealloc needs to do something like. if (self->fields == Py_None) { print something incref(self) return; } Most likely there is a missing Py_INCREF() before some call that uses the data-type object (and consumes it's reference count) --- do you have any Pyrex code (it's harder to get it right with Pyrex).
I realize this is probably a very difficult problem to track down, but it really sucks to run a code for 4 days only to have it explode at the end. Right now this is starting to be a serious problem for us as we move our codes into large production runs, so I'm willing to put in the necessary effort to track it down, though I'll need some guidance from our gurus.
Tracking the reference count of the built-in data-type objects should not be too difficult. First, figure out which one is causing problems (if you still have the gdb traceback, then go up to the arraydescr_dealloc function and look at self->type_num and self->type). Then, put print statements throughout your code for the reference count of this data-type object. Something like, sys.getrefcount(numpy.dtype('float')) would be enough at a looping point in your code. Good luck, -Travis ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
![](https://secure.gravatar.com/avatar/95198572b00e5fbcd97fb5315215bf7a.jpg?s=120&d=mm&r=g)
On 10/23/06, Travis Oliphant <oliphant.travis@ieee.org> wrote:
Fernando Perez wrote:
If you point me to the right place in the sources, I'll be happy to add something to my local copy, rebuild numpy and rerun with these print statements in place.
I've placed them in SVN (r3384):
[...] Great, thanks. I'll rebuild everything from SVN.
Tracking the reference count of the built-in data-type objects should not be too difficult. First, figure out which one is causing problems (if you still have the gdb traceback, then go up to the arraydescr_dealloc function and look at self->type_num and self->type).
Unfortunately we closed that gdb session.
Then, put print statements throughout your code for the reference count of this data-type object.
Something like,
sys.getrefcount(numpy.dtype('float'))
OK, we'll log those into a file and will report after another multi-day run. Thanks again for the help! Cheers, f ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
![](https://secure.gravatar.com/avatar/95198572b00e5fbcd97fb5315215bf7a.jpg?s=120&d=mm&r=g)
On 10/23/06, Travis Oliphant <oliphant.travis@ieee.org> wrote:
I've placed them in SVN (r3384):
arraydescr_dealloc needs to do something like.
if (self->fields == Py_None) { print something incref(self) return; }
Here is some more info. We left a long-running job over the weekend with the prints you suggested. Oddly, something happened at the OS level which killed our SSH connection to that machine, but the above numpy dealloc() warning never printed (we logged this). What did happen is that the refcount you suggested we print: sys.getrefcount(numpy.dtype('float')) eventually seems to have wrapped around and gone negative. I'm attaching the log file with those print statements, the key point is that this happens eventually: PSVD Iteration 19 Ref count 1989827662 bar 444 PSVD Iteration 0 Ref count 2021353399 PSVD Iteration 1 Ref count 2143386207 PSVD Iteration 2 Ref count -2001245193 PSVD Iteration 3 Ref count -1915816437 PSVD Iteration 4 Ref count -1902698473 That refcount is for dtype('float') as indicated above. Is it not a problem that this particular refcount goes negative? Eventually it may continue increasing and hit a zero, point at which I imagine that the bad dealloc will occur. Are refcounts stored in signed 32-bit ints? Why? I'd have naively expected them to be stored in unsigned longs to avoid wraparound problems, but maybe I'm completely missing the real problem here. We've started another run to see if we can get the actual crash to happen, will report. Cheers, f ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On 10/30/06, Fernando Perez <fperez.net@gmail.com> wrote:
On 10/23/06, Travis Oliphant <oliphant.travis@ieee.org> wrote:
I've placed them in SVN (r3384):
arraydescr_dealloc needs to do something like.
if (self->fields == Py_None) { print something incref(self) return; }
Here is some more info. We left a long-running job over the weekend with the prints you suggested. Oddly, something happened at the OS level which killed our SSH connection to that machine, but the above numpy dealloc() warning never printed (we logged this).
What did happen is that the refcount you suggested we print:
sys.getrefcount(numpy.dtype('float'))
eventually seems to have wrapped around and gone negative. I'm attaching the log file with those print statements, the key point is that this happens eventually:
PSVD Iteration 19 Ref count 1989827662 bar 444 PSVD Iteration 0 Ref count 2021353399 PSVD Iteration 1 Ref count 2143386207 PSVD Iteration 2 Ref count -2001245193 PSVD Iteration 3 Ref count -1915816437 PSVD Iteration 4 Ref count -1902698473
That refcount is for dtype('float') as indicated above. Is it not a problem that this particular refcount goes negative? Eventually it may continue increasing and hit a zero, point at which I imagine that the bad dealloc will occur.
Are refcounts stored in signed 32-bit ints? Why? I'd have naively expected them to be stored in unsigned longs to avoid wraparound problems, but maybe I'm completely missing the real problem here.
I suspect the real problem is that the refcount keeps going up. Even if it was unsigned it would eventually wrap to zero and with a bit of luck get garbage collected. So probably something isn't decrementing the refcount. Chuck ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
![](https://secure.gravatar.com/avatar/95198572b00e5fbcd97fb5315215bf7a.jpg?s=120&d=mm&r=g)
On 10/30/06, Charles R Harris <charlesr.harris@gmail.com> wrote:
I suspect the real problem is that the refcount keeps going up. Even if it was unsigned it would eventually wrap to zero and with a bit of luck get garbage collected. So probably something isn't decrementing the refcount.
Oops, my bad: I meant *unsigned long long*, so that the refcount is a 64-bit object. By the time it wraps around, you'll have run out of memory long ago. Having 32 bit ref counters can potentially mean you run out of the counter before you run out of RAM on a system with sufficient memory. Cheers, f ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
![](https://secure.gravatar.com/avatar/ab7e74f2443b81e5175638d72be65e07.jpg?s=120&d=mm&r=g)
On 30/10/06, Fernando Perez <fperez.net@gmail.com> wrote:
On 10/30/06, Charles R Harris <charlesr.harris@gmail.com> wrote:
I suspect the real problem is that the refcount keeps going up. Even if it was unsigned it would eventually wrap to zero and with a bit of luck get garbage collected. So probably something isn't decrementing the refcount.
Oops, my bad: I meant *unsigned long long*, so that the refcount is a 64-bit object. By the time it wraps around, you'll have run out of memory long ago. Having 32 bit ref counters can potentially mean you run out of the counter before you run out of RAM on a system with sufficient memory.
Yes, this is a feature(?) of python as it currently stands (I checked 2.5) - reference counts are 32-bit signed integers, so if you have an object that has enough references, python will be exceedingly unhappy: http://mail.python.org/pipermail/python-dev/2002-September/028679.html It is of course possible that you actually have that many references to some object, but it seems to me you'd notice twenty-four gigabytes of pointers floating around... A. M. Archibald ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
![](https://secure.gravatar.com/avatar/5f37aff3d274a0effbf20be82804d012.jpg?s=120&d=mm&r=g)
On 10/30/06, Fernando Perez <fperez.net@gmail.com> wrote:
On 10/30/06, Charles R Harris <charlesr.harris@gmail.com> wrote:
I suspect the real problem is that the refcount keeps going up. Even if it was unsigned it would eventually wrap to zero and with a bit of luck get garbage collected. So probably something isn't decrementing the refcount.
Oops, my bad: I meant *unsigned long long*, so that the refcount is a 64-bit object. By the time it wraps around, you'll have run out of memory long ago. Having 32 bit ref counters can potentially mean you run out of the counter before you run out of RAM on a system with sufficient memory.
Cheers,
FYI, this is what is defined in Include/object.h /* PyObject_HEAD defines the initial segment of every PyObject. */ #define PyObject_HEAD \ _PyObject_HEAD_EXTRA \ Py_ssize_t ob_refcnt; \ struct _typeobject *ob_type; #define Py_INCREF(op) ( \ _Py_INC_REFTOTAL _Py_REF_DEBUG_COMMA \ (op)->ob_refcnt++) #define Py_DECREF(op) \ if (_Py_DEC_REFTOTAL _Py_REF_DEBUG_COMMA \ --(op)->ob_refcnt != 0) \ _Py_CHECK_REFCNT(op) \ else \ _Py_Dealloc((PyObject *)(op)) And '_Py_CHECK_REFCNT' macro will finally call Py_FatalError 'ob_refcnt' is a Py_ssize_t integer, so I think you will not be able to overflow it, unless in case of C code with refcounting bugs. Am I right? -- Lisandro Dalcín --------------- Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC) Instituto de Desarrollo Tecnológico para la Industria Química (INTEC) Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) PTLC - Güemes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594 ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
![](https://secure.gravatar.com/avatar/5f37aff3d274a0effbf20be82804d012.jpg?s=120&d=mm&r=g)
On 10/30/06, Lisandro Dalcin <dalcinl@gmail.com> wrote:
FYI, this is what is defined in Include/object.h
I forgoy to say in Python-2.5 -- Lisandro Dalcín --------------- Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC) Instituto de Desarrollo Tecnológico para la Industria Química (INTEC) Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) PTLC - Güemes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594 ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
![](https://secure.gravatar.com/avatar/95198572b00e5fbcd97fb5315215bf7a.jpg?s=120&d=mm&r=g)
On 10/30/06, Lisandro Dalcin <dalcinl@gmail.com> wrote:
FYI, this is what is defined in Include/object.h
/* PyObject_HEAD defines the initial segment of every PyObject. */ #define PyObject_HEAD \ _PyObject_HEAD_EXTRA \ Py_ssize_t ob_refcnt; \ struct _typeobject *ob_type;
#define Py_INCREF(op) ( \ _Py_INC_REFTOTAL _Py_REF_DEBUG_COMMA \ (op)->ob_refcnt++)
#define Py_DECREF(op) \ if (_Py_DEC_REFTOTAL _Py_REF_DEBUG_COMMA \ --(op)->ob_refcnt != 0) \ _Py_CHECK_REFCNT(op) \ else \ _Py_Dealloc((PyObject *)(op))
And '_Py_CHECK_REFCNT' macro will finally call Py_FatalError
'ob_refcnt' is a Py_ssize_t integer, so I think you will not be able to overflow it, unless in case of C code with refcounting bugs. Am I right?
I think you are, and fortunately this indicates that they /did/ change to a longer data type for refcounting in newer pythons. The box where we have this problem is running 2.3 though, and obviously a runaway refcount in f2py can still die even if it's a longer data type. However, Travis mentioned he just fixed precisely a bug like that in f2py, so I'm optimistic, and I'm currently making a new test. Thanks for the info, f ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
![](https://secure.gravatar.com/avatar/d5321459a9b36ca748932987de93e083.jpg?s=120&d=mm&r=g)
Fernando Perez wrote:
Here is some more info. We left a long-running job over the weekend with the prints you suggested. Oddly, something happened at the OS level which killed our SSH connection to that machine, but the above numpy dealloc() warning never printed (we logged this). As an aside, I always use GNU screen when starting long-running jobs just in case something like this happens. screen lets you reconnect to a session from any login to a machine. Just a point of information...
-Andrew ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
![](https://secure.gravatar.com/avatar/95198572b00e5fbcd97fb5315215bf7a.jpg?s=120&d=mm&r=g)
On 10/23/06, Travis Oliphant <oliphant.travis@ieee.org> wrote:
I've placed them in SVN (r3384):
arraydescr_dealloc needs to do something like.
if (self->fields == Py_None) { print something incref(self) return; }
Most likely there is a missing Py_INCREF() before some call that uses the data-type object (and consumes it's reference count) --- do you have any Pyrex code (it's harder to get it right with Pyrex).
OK, we've completed another long run (several days), and this time it didn't crash. But I think there are still refcount problems. I'm attaching the full log file and a plot of the refcount. It's wrapping around, and after some point the increase switches to a pefectly linear pattern, I'm not exactly sure why (it could be a change in the underlying code after the initialization phase; it's not my code so I don't know its internals). I hope this helps, it would be nice to track this down before 1.0.1 is out. Cheers, f ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
![](https://secure.gravatar.com/avatar/95198572b00e5fbcd97fb5315215bf7a.jpg?s=120&d=mm&r=g)
On 10/23/06, Travis Oliphant <oliphant.travis@ieee.org> wrote:
I've placed them in SVN (r3384):
arraydescr_dealloc needs to do something like.
if (self->fields == Py_None) { print something incref(self) return; }
Travis, I know you're busy right now, so this message is just so that the archives have this info, for whenever you revisit the problem. A long run of our code is now producing the following output: *** Reference count error detected: an attempt was made to deallocate 12 (d) *** *** Reference count error detected: an attempt was made to deallocate 12 (d) *** etc. Thanks to your changes it does not crash anymore, so it's not a big deal for us. Whenever you want further details, I can try to collect them. Regards, f ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
![](https://secure.gravatar.com/avatar/dd7980fbcb8033634af99fca18f97062.jpg?s=120&d=mm&r=g)
Hey, any chance it has something to do with running an up-to-date numpy with an "out-of-date" Python (2.3.4 is pretty old, isn't it?) DG Fernando Perez wrote:
Hi all,
two colleagues have been seeing occasional crashes from very long-running code which uses numpy. We've now gotten a backtrace from one such crash, unfortunately it uses a build from a few days ago:
In [3]: numpy.__version__ Out[3]: '1.0b5.dev3097'
In [4]: scipy.__version__ Out[4]: '0.5.0.2180'
Because it takes so long to get the code to crash (several days of 100%CPU usage), I can't make a new one right now, but I'll be happy to restart the same run with a current SVN build if necessary, and post the results in a few days.
In the meantime, here's a gdb backtrace we were able to get by setting MALLOC_CHECK_ to 2 and running the python process from within gdb:
Program received signal SIGABRT, Aborted. [Switching to Thread 1073880896 (LWP 26280)] 0x40000402 in __kernel_vsyscall () (gdb) bt #0 0x40000402 in __kernel_vsyscall () #1 0x0042c7d5 in raise () from /lib/tls/libc.so.6 #2 0x0042e149 in abort () from /lib/tls/libc.so.6 #3 0x0046b665 in free_check () from /lib/tls/libc.so.6 #4 0x00466e65 in free () from /lib/tls/libc.so.6 #5 0x005a4ab7 in PyObject_Free () from /usr/lib/libpython2.3.so.1.0 #6 0x403f6336 in arraydescr_dealloc (self=0x40424020) at arrayobject.c:10455 #7 0x403fab3e in PyArray_FromArray (arr=0xe081cb0, newtype=0x40424020, flags=0) at arrayobject.c:7725 #8 0x403facc3 in PyArray_FromAny (op=0xe081cb0, newtype=0x0, min_depth=0, max_depth=0, flags=0, context=0x0) at arrayobject.c:8178 #9 0x4043bc45 in PyUFunc_GenericFunction (self=0x943a660, args=0xa9dbf2c, mps=0xbfc83730) at ufuncobject.c:906 #10 0x40440a04 in ufunc_generic_call (self=0x943a660, args=0xa9dbf2c) at ufuncobject.c:2742 #11 0x0057d607 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0 #12 0x0057d6d4 in PyObject_CallFunction () from /usr/lib/libpython2.3.so.1.0 #13 0x403eabb6 in PyArray_GenericBinaryFunction (m1=Variable "m1" is not available. ) at arrayobject.c:3296 #14 0x0057b7e1 in PyNumber_Check () from /usr/lib/libpython2.3.so.1.0 #15 0x0057c1e0 in PyNumber_Multiply () from /usr/lib/libpython2.3.so.1.0 #16 0x005d16a3 in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0 #17 0x005d509e in PyEval_EvalCodeEx () from /usr/lib/libpython2.3.so.1.0 #18 0x005d3d8f in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0 #19 0x005d509e in PyEval_EvalCodeEx () from /usr/lib/libpython2.3.so.1.0 #20 0x00590e2e in PyFunction_SetClosure () from /usr/lib/libpython2.3.so.1.0 #21 0x0057d607 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0 #22 0x00584d98 in PyMethod_New () from /usr/lib/libpython2.3.so.1.0 #23 0x0057d607 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0 #24 0x005b584c in _PyObject_SlotCompare () from /usr/lib/libpython2.3.so.1.0 #25 0x005aec2c in PyType_IsSubtype () from /usr/lib/libpython2.3.so.1.0 #26 0x0057d607 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0 #27 0x005d2b7f in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0 #28 0x005d509e in PyEval_EvalCodeEx () from /usr/lib/libpython2.3.so.1.0 #29 0x005d3d8f in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0 #30 0x005d509e in PyEval_EvalCodeEx () from /usr/lib/libpython2.3.so.1.0 #31 0x005d3d8f in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0 #32 0x005d497b in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0 #33 0x005d497b in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0 #34 0x005d497b in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0 #35 0x005d509e in PyEval_EvalCodeEx () from /usr/lib/libpython2.3.so.1.0 #36 0x005d5362 in PyEval_EvalCode () from /usr/lib/libpython2.3.so.1.0 #37 0x005ee817 in PyErr_Display () from /usr/lib/libpython2.3.so.1.0 #38 0x005ef942 in PyRun_SimpleFileExFlags () from /usr/lib/libpython2.3.so.1.0 #39 0x005f0994 in PyRun_AnyFileExFlags () from /usr/lib/libpython2.3.so.1.0 #40 0x005f568e in Py_Main () from /usr/lib/libpython2.3.so.1.0 #41 0x080485b2 in main ()
# End of BT.
This code is running on a Fedora Core 3 box, with python 2.3.4 and numpy/scipy compiled using gcc 3.4.4.
I realize that it's extremely difficult to help with so little information, but unfortunately we have no small test that can reproduce the problem. Only our large research codes, when running for multiple days on a single run, cause this. Even very intensive uses of the same code but which last only a few hours never show this.
This code is a long-runing iterative algorithm, so it's basically applying the same (complex) loop over and over until convergence, using numpy and scipy pretty extensively throughout.
If super Travis (or anyone else) can have a Eureka moment from the above backtrace, that would be fantastic. If there's any other information you think I may be able to provide, I'll be happy to do my best.
Cheers,
f
------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
participants (10)
-
A. M. Archibald
-
Albert Strasheim
-
Andrew Straw
-
Charles R Harris
-
David Cournapeau
-
David Goldsmith
-
Fernando Perez
-
Lisandro Dalcin
-
Stefan van der Walt
-
Travis Oliphant