How to debug reference counting errors
Hi, There is segfault reported here: http://projects.scipy.org/numpy/ticket/1588 I've managed to isolate the problem and even provide a simple patch, that fixes it here: https://github.com/numpy/numpy/issues/398 however the patch simply doesn't decrease the proper reference, so it might leak. I've used bisection (took the whole evening unfortunately...) but the good news is that I've isolated commits that actually broke it. See the github issue #398 for details, diffs etc. Unfortunately, it's 12 commits from Mark and the individual commits raise exception on the segfaulting code, so I can't pin point the problem further. In general, how can I debug this sort of problem? I tried to use valgrind, with a debugging build of numpy, but it provides tons of false (?) positives: https://gist.github.com/3549063 Mark, by looking at the changes that broke it, as well as at my "fix", do you see where the problem could be? I suspect it is something with the changes in PyArray_FromAny() or PyArray_FromArray() in ctors.c. But I don't see anything so far that could cause it. Thanks for any help. This is one of the issues blocking the 1.7.0 release. Ondrej
Hi, re: valgrind - to get better results you might try the suggestions from: http://svn.python.org/projects/python/trunk/Misc/README.valgrind Richard On 31 August 2012 09:03, Ondřej Čertík <ondrej.certik@gmail.com> wrote:
Hi,
There is segfault reported here:
http://projects.scipy.org/numpy/ticket/1588
I've managed to isolate the problem and even provide a simple patch, that fixes it here:
https://github.com/numpy/numpy/issues/398
however the patch simply doesn't decrease the proper reference, so it might leak. I've used bisection (took the whole evening unfortunately...) but the good news is that I've isolated commits that actually broke it. See the github issue #398 for details, diffs etc.
Unfortunately, it's 12 commits from Mark and the individual commits raise exception on the segfaulting code, so I can't pin point the problem further.
In general, how can I debug this sort of problem? I tried to use valgrind, with a debugging build of numpy, but it provides tons of false (?) positives: https://gist.github.com/3549063
Mark, by looking at the changes that broke it, as well as at my "fix", do you see where the problem could be?
I suspect it is something with the changes in PyArray_FromAny() or PyArray_FromArray() in ctors.c. But I don't see anything so far that could cause it.
Thanks for any help. This is one of the issues blocking the 1.7.0 release.
Ondrej _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On 08/31/2012 09:03 AM, Ondřej Čertík wrote:
Hi,
There is segfault reported here:
http://projects.scipy.org/numpy/ticket/1588
I've managed to isolate the problem and even provide a simple patch, that fixes it here:
https://github.com/numpy/numpy/issues/398
however the patch simply doesn't decrease the proper reference, so it might leak. I've used bisection (took the whole evening unfortunately...) but the good news is that I've isolated commits that actually broke it. See the github issue #398 for details, diffs etc.
Unfortunately, it's 12 commits from Mark and the individual commits raise exception on the segfaulting code, so I can't pin point the problem further.
In general, how can I debug this sort of problem? I tried to use valgrind, with a debugging build of numpy, but it provides tons of false (?) positives: https://gist.github.com/3549063
Mark, by looking at the changes that broke it, as well as at my "fix", do you see where the problem could be?
I suspect it is something with the changes in PyArray_FromAny() or PyArray_FromArray() in ctors.c. But I don't see anything so far that could cause it.
Thanks for any help. This is one of the issues blocking the 1.7.0 release.
IIRC you can recompile Python with some support for detecting memory leaks. One of the issues with using Valgrind, after suppressing the false positives, is that Python uses its own memory allocator so that sits between the bug and what Valgrind detects. So at least recompile Python to not do that. As for hardening the NumPy source in general, you should at least be aware of these two options: 1) David Malcolm (dmalcolm@redhat.com) was writing a static code analysis plugin for gcc that would check every routine that the reference count semantics was correct. (I don't know how far he's got with that.) 2) In Cython we have a "reference count nanny". This requires changes to all the code though, so not an option just for finding this bug, just thought I'd mention it. In addition to the INCREF/DECREF you need to insert new "GIVEREF" and "GOTREF" calls (which are noops in a normal compile) to declare where you get and give away a reference. When Cython-generated sources are enabled with -DCYTHON_REFNANNY, INCREF/DECREF/GIVEREF/GOTREF are tracked within each function and a failure is raised if the function violates any contract. Dag
Hi Dag, On Fri, Aug 31, 2012 at 4:22 AM, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 08/31/2012 09:03 AM, Ondřej Čertík wrote:
Hi,
There is segfault reported here:
http://projects.scipy.org/numpy/ticket/1588
I've managed to isolate the problem and even provide a simple patch, that fixes it here:
https://github.com/numpy/numpy/issues/398
however the patch simply doesn't decrease the proper reference, so it might leak. I've used bisection (took the whole evening unfortunately...) but the good news is that I've isolated commits that actually broke it. See the github issue #398 for details, diffs etc.
Unfortunately, it's 12 commits from Mark and the individual commits raise exception on the segfaulting code, so I can't pin point the problem further.
In general, how can I debug this sort of problem? I tried to use valgrind, with a debugging build of numpy, but it provides tons of false (?) positives: https://gist.github.com/3549063
Mark, by looking at the changes that broke it, as well as at my "fix", do you see where the problem could be?
I suspect it is something with the changes in PyArray_FromAny() or PyArray_FromArray() in ctors.c. But I don't see anything so far that could cause it.
Thanks for any help. This is one of the issues blocking the 1.7.0 release.
IIRC you can recompile Python with some support for detecting memory leaks. One of the issues with using Valgrind, after suppressing the false positives, is that Python uses its own memory allocator so that sits between the bug and what Valgrind detects. So at least recompile Python to not do that.
Right. Compiling with "--without-pymalloc" (per README.valgrind as suggested above by Richard) should improve things a lot. Thanks for the tip.
As for hardening the NumPy source in general, you should at least be aware of these two options:
1) David Malcolm (dmalcolm@redhat.com) was writing a static code analysis plugin for gcc that would check every routine that the reference count semantics was correct. (I don't know how far he's got with that.)
2) In Cython we have a "reference count nanny". This requires changes to all the code though, so not an option just for finding this bug, just thought I'd mention it. In addition to the INCREF/DECREF you need to insert new "GIVEREF" and "GOTREF" calls (which are noops in a normal compile) to declare where you get and give away a reference. When Cython-generated sources are enabled with -DCYTHON_REFNANNY, INCREF/DECREF/GIVEREF/GOTREF are tracked within each function and a failure is raised if the function violates any contract.
I see. That's a nice option. For my own code, I never touch the reference counting by hand and rather just use Cython. In the meantime, Mark fixed it: https://github.com/numpy/numpy/pull/400 https://github.com/numpy/numpy/pull/405 Mark, thanks again for this. That saved me a lot of time. Ondrej
On Fri, Aug 31, 2012 at 5:35 PM, Ondřej Čertík <ondrej.certik@gmail.com>wrote:
Hi Dag,
On Fri, Aug 31, 2012 at 4:22 AM, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 08/31/2012 09:03 AM, Ondřej Čertík wrote:
Hi,
There is segfault reported here:
http://projects.scipy.org/numpy/ticket/1588
I've managed to isolate the problem and even provide a simple patch, that fixes it here:
https://github.com/numpy/numpy/issues/398
however the patch simply doesn't decrease the proper reference, so it might leak. I've used bisection (took the whole evening unfortunately...) but the good news is that I've isolated commits that actually broke it. See the github issue #398 for details, diffs etc.
Unfortunately, it's 12 commits from Mark and the individual commits raise exception on the segfaulting code, so I can't pin point the problem further.
In general, how can I debug this sort of problem? I tried to use valgrind, with a debugging build of numpy, but it provides tons of false (?) positives: https://gist.github.com/3549063
Mark, by looking at the changes that broke it, as well as at my "fix", do you see where the problem could be?
I suspect it is something with the changes in PyArray_FromAny() or PyArray_FromArray() in ctors.c. But I don't see anything so far that could cause it.
Thanks for any help. This is one of the issues blocking the 1.7.0 release.
IIRC you can recompile Python with some support for detecting memory leaks. One of the issues with using Valgrind, after suppressing the false positives, is that Python uses its own memory allocator so that sits between the bug and what Valgrind detects. So at least recompile Python to not do that.
Right. Compiling with "--without-pymalloc" (per README.valgrind as suggested above by Richard) should improve things a lot. Thanks for the tip.
As for hardening the NumPy source in general, you should at least be aware of these two options:
1) David Malcolm (dmalcolm@redhat.com) was writing a static code analysis plugin for gcc that would check every routine that the reference count semantics was correct. (I don't know how far he's got with that.)
2) In Cython we have a "reference count nanny". This requires changes to all the code though, so not an option just for finding this bug, just thought I'd mention it. In addition to the INCREF/DECREF you need to insert new "GIVEREF" and "GOTREF" calls (which are noops in a normal compile) to declare where you get and give away a reference. When Cython-generated sources are enabled with -DCYTHON_REFNANNY, INCREF/DECREF/GIVEREF/GOTREF are tracked within each function and a failure is raised if the function violates any contract.
I see. That's a nice option. For my own code, I never touch the reference counting by hand and rather just use Cython.
In the meantime, Mark fixed it:
https://github.com/numpy/numpy/pull/400 https://github.com/numpy/numpy/pull/405
Mark, thanks again for this. That saved me a lot of time.
No problem. The way I prefer to deal with this kind of error is use C++ smart pointers. C++11's unique_ptr and boost's intrusive_ptr are both useful for painlessly managing this kind of reference counting headache. -Mark
Ondrej _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Fri, Aug 31, 2012 at 5:56 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:
On Fri, Aug 31, 2012 at 5:35 PM, Ondřej Čertík <ondrej.certik@gmail.com> wrote:
Hi Dag,
On Fri, Aug 31, 2012 at 4:22 AM, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 08/31/2012 09:03 AM, Ondřej Čertík wrote:
Hi,
There is segfault reported here:
http://projects.scipy.org/numpy/ticket/1588
I've managed to isolate the problem and even provide a simple patch, that fixes it here:
https://github.com/numpy/numpy/issues/398
however the patch simply doesn't decrease the proper reference, so it might leak. I've used bisection (took the whole evening unfortunately...) but the good news is that I've isolated commits that actually broke it. See the github issue #398 for details, diffs etc.
Unfortunately, it's 12 commits from Mark and the individual commits raise exception on the segfaulting code, so I can't pin point the problem further.
In general, how can I debug this sort of problem? I tried to use valgrind, with a debugging build of numpy, but it provides tons of false (?) positives: https://gist.github.com/3549063
Mark, by looking at the changes that broke it, as well as at my "fix", do you see where the problem could be?
I suspect it is something with the changes in PyArray_FromAny() or PyArray_FromArray() in ctors.c. But I don't see anything so far that could cause it.
Thanks for any help. This is one of the issues blocking the 1.7.0 release.
IIRC you can recompile Python with some support for detecting memory leaks. One of the issues with using Valgrind, after suppressing the false positives, is that Python uses its own memory allocator so that sits between the bug and what Valgrind detects. So at least recompile Python to not do that.
Right. Compiling with "--without-pymalloc" (per README.valgrind as suggested above by Richard) should improve things a lot. Thanks for the tip.
As for hardening the NumPy source in general, you should at least be aware of these two options:
1) David Malcolm (dmalcolm@redhat.com) was writing a static code analysis plugin for gcc that would check every routine that the reference count semantics was correct. (I don't know how far he's got with that.)
2) In Cython we have a "reference count nanny". This requires changes to all the code though, so not an option just for finding this bug, just thought I'd mention it. In addition to the INCREF/DECREF you need to insert new "GIVEREF" and "GOTREF" calls (which are noops in a normal compile) to declare where you get and give away a reference. When Cython-generated sources are enabled with -DCYTHON_REFNANNY, INCREF/DECREF/GIVEREF/GOTREF are tracked within each function and a failure is raised if the function violates any contract.
I see. That's a nice option. For my own code, I never touch the reference counting by hand and rather just use Cython.
In the meantime, Mark fixed it:
https://github.com/numpy/numpy/pull/400 https://github.com/numpy/numpy/pull/405
Mark, thanks again for this. That saved me a lot of time.
No problem. The way I prefer to deal with this kind of error is use C++ smart pointers. C++11's unique_ptr and boost's intrusive_ptr are both useful for painlessly managing this kind of reference counting headache.
Oh yes. I prefer to use Trilinos' RCP, which is a shared pointer (just like in C++11), but has better debugging info if something goes wrong. It can be compiled in two modes -- one is slower and it can't segfault, and the other is optimized, most operations are at native raw pointer speed, but it can segfault. Ondrej
participants (4)
-
Dag Sverre Seljebotn
-
Mark Wiebe
-
Ondřej Čertík
-
Richard Hattersley