[Python-Dev] python2.1.1 SEGV in GC on Solaris 2.7
Neil Schemenauer
nas@python.ca
Thu, 18 Oct 2001 07:38:15 -0700
[I'm moving the discussion here from SF, using the tracker is too
painful.]
Anthony Baxter:
> I've got a Zope installation where python2.1.1 is
> segfaulting on Solaris2.7 - it's running a largish
> ZEO server.
[..]
> Here's the trace with debugging enabled:
>
> #0 0xff00 in ?? ()
> #1 0x402f0 in collect (young=0x9b538, old=0x9b544) at
> ./Modules/gcmodule.c:379
> #2 0x405a8 in collect_generations () at
> ./Modules/gcmodule.c:484
> #3 0x40624 in _PyGC_Insert (op=0xbc1f24) at
> ./Modules/gcmodule.c:507
> #4 0x5a224 in PyList_New (size=0) at Objects/listobject.c:61
> #5 0x21bc8 in eval_code2 (co=0x1cb370, globals=0x21bc0,
> locals=0x67,
> args=0x0, argcount=1, kws=0xf89b24, kwcount=0, defs=0x0,
> defcount=0,
> closure=0xbc1f24) at Python/ceval.c:1741
>
> Next trick is to rebuild without any optimisation (sigh)
> as I suspect that it's inlined subtract_refs().
Martin v. Löwis:
> It would be interesting what the value of "gc" is at the
> time of the crash. It looks like you got an object that
> claims to support GC but has a null tp_traverse.
Anthony Baxter:
> Ok, I have an intact core file, and a matching binary,
> no optimisations, nothing. This crash is showing the
> crash at line 166 of gcmodule.c
> traverse = PyObject_FROM_GC(gc)->ob_type->tp_traverse;
> PyObject_FROM_GC(gc)->ob_type in this case is
>
> $24 = {ob_refcnt = 1, ob_type = 0x0}
>
> To check my logic, I checked gc_next and gc_prev using
> the same GDB magic, and they correctly show up as a tuple
> and an instance method.
>
> Some fiddling around seems to rule out stack space as the
> problem, as well. We're going to try and see if purify
> helps here, but the problem looks to be a junk object -
> I have no idea how to track this down further. Help?
> Would taking the horrible horrible hack of removing the
> object from the gc linked list if ob_type is null help?
> Well, it'd stop the crashes, anyway.
Martin v. Löwis:
> There are two options:
>
> a) the object isn't really a GC object, i.e. has no GC
> header. In gdb, you can try to cast gc to PyObject*, then
> see if the resulting pointer has a better ob_type (this is
> unlikely, though, since the logic entering the object was
> already using fromgc/togc)
>
> b) somebody has cleared the ob_type field.
>
> Can you guarantee that all extension modules have been
> compiled with the 2.1.1 header files?
>
> Is the problem repeatable in the sense that gc will have the
> same pointer value on each crash? If so, it is relatively
> easy to track down: just set a gdb change watchpoint on the
> address on the ob_type field of that address (note that
> setting watchpoints is not possible until there is really
> mapped memory on that address).
>
> If you can't analyse it through change breakpoints, I
> recommend to annotate the interpreter in the following way:
> in pyobject_init, put a printf that prints the address and
> the tp_name of the type. In subtract_refs, if the ob_type
> slot is null, print the address of the object and abort.
> Then analyse the log to see whether a object really has been
> allocated on that address, and what its type was (make sure
> you consider the possibility that address are off by the
> delta that FROM_GC adds).
Anthony Baxter:
> It's not a GC object. I'm positive all the extension
> objects are correct - I just recompiled, without the
> 1.5/2.0 headers around.
> It's a different pointer each time round, unfortunately. It
> also takes anything from 5 minutes to 2 hours to reproduce.
> I've got about 4 copies of it running now, and I've got a
> bunch of different core files. I've grabbed purify and an
> eval license, and I'm feeding it the binary.
>
> The printf approach is probably not going to work - these
> are busy busy Zope servers. Instead, my plan, assuming that
> purify doesn't immediately spot a problem, is to change the
> code so that if it gets a dud GC object, it will just bust
> it out of the tree and let it leak, and print a message
> saying so. Then I can quit the program, and purify will
> tell me 'hey, you leaked!' and also tell me where it was
> allocated.
>
> More concerning, about half the segfaults are not from the
> GC at all, but from realloc in PyFrame_New (line 161 of
> frameobject). These are the only two I'm getting - it's
> split 50-50 amongst the 10 coredumps I have now. I'm not
> sure whether to open a seperate bug for this.
>
> Has python2.1.1 been purified? With Zope and zope's
> extensions?
>
>
> Wow - it's amazing how this SF bug thing is so painful for
> conversations :)
The ob_type pointer must be getting cleared after the object has been
added to the GC lists. The PyObject_IS_GC call in _PyGC_Insert would
have segfaulted otherwise. Knowing the type of the object would be
helpful in debugging the problem. I suggest reconsidering Martin's
printf idea. You could add something like this to _PyGC_Insert:
void
_PyGC_Insert(PyObject *op)
{
static int did_open = 0;
static FILE *log;
if (!did_open) {
did_open = 1;
log = fopen("type.log", "w");
}
fprintf(log, "%p %p\n", op, op->ob_type);
...
Debugging this type of problem is really hard (as you already know)
because the effect of the bug is found so far away from the source.
Neil