[Python-Dev] python2.1.1 SEGV in GC on Solaris 2.7

Thu, 18 Oct 2001 07:38:15 -0700

[I'm moving the discussion here from SF, using the tracker is too
painful.]

Anthony Baxter:
> I've got a Zope installation where python2.1.1 is
> segfaulting on Solaris2.7 - it's running a largish 
> ZEO server.
[..]
> Here's the trace with debugging enabled:
> 
> #0  0xff00 in ?? ()
> #1  0x402f0 in collect (young=0x9b538, old=0x9b544) at
> ./Modules/gcmodule.c:379
> #2  0x405a8 in collect_generations () at
> ./Modules/gcmodule.c:484
> #3  0x40624 in _PyGC_Insert (op=0xbc1f24) at
> ./Modules/gcmodule.c:507
> #4  0x5a224 in PyList_New (size=0) at Objects/listobject.c:61
> #5  0x21bc8 in eval_code2 (co=0x1cb370, globals=0x21bc0,
> locals=0x67,
>     args=0x0, argcount=1, kws=0xf89b24, kwcount=0, defs=0x0,
> defcount=0,
>     closure=0xbc1f24) at Python/ceval.c:1741
> 
> Next trick is to rebuild without any optimisation (sigh)
> as I suspect that it's inlined subtract_refs().

Martin v. Löwis:
> It would be interesting what the value of "gc" is at the 
> time of the crash. It looks like you got an object that 
> claims to support GC but has a null tp_traverse.

Anthony Baxter:
> Ok, I have an intact core file, and a matching binary,
> no optimisations, nothing. This crash is showing the
> crash at line 166 of gcmodule.c
>  traverse = PyObject_FROM_GC(gc)->ob_type->tp_traverse;
> PyObject_FROM_GC(gc)->ob_type in this case is
> 
> $24 = {ob_refcnt = 1, ob_type = 0x0}
> 
> To check my logic, I checked gc_next and gc_prev using 
> the same GDB magic, and they correctly show up as a tuple
> and an instance method. 
> 
> Some fiddling around seems to rule out stack space as the
> problem, as well. We're going to try and see if purify 
> helps here, but the problem looks to be a junk object - 
> I have no idea how to track this down further. Help?
> Would taking the horrible horrible hack of removing the
> object from the gc linked list if ob_type is null help?
> Well, it'd stop the crashes, anyway.

Martin v. Löwis:
> There are two options:
> 
> a) the object isn't really a GC object, i.e. has no GC
> header. In gdb, you can try to cast gc to PyObject*, then
> see if the resulting pointer has a better ob_type (this is
> unlikely, though, since the logic entering the object was
> already using fromgc/togc)
> 
> b) somebody has cleared the ob_type field.
> 
> Can you guarantee that all extension modules have been
> compiled with the 2.1.1 header files?
> 
> Is the problem repeatable in the sense that gc will have the
> same pointer value on each crash? If so, it is relatively
> easy to track down: just set a gdb change watchpoint on the
> address on the ob_type field of that address (note that
> setting watchpoints is not possible until there is really
> mapped memory on that address).
> 
> If you can't analyse it through change breakpoints, I
> recommend to annotate the interpreter in the following way:
> in pyobject_init, put a printf that prints the address and
> the tp_name of the type. In subtract_refs, if the ob_type
> slot is null, print the address of the object and abort.
> Then analyse the log to see whether a object really has been
> allocated on that address, and what its type was (make sure
> you consider the possibility that address are off by the
> delta that FROM_GC adds).

Anthony Baxter:
> It's not a GC object. I'm positive all the extension 
> objects are correct - I just recompiled, without the
> 1.5/2.0 headers around.
> It's a different pointer each time round, unfortunately. It 
> also takes anything from 5 minutes to 2 hours to reproduce.
> I've got about 4 copies of it running now, and I've got a
> bunch of different core files. I've grabbed purify and an
> eval license, and I'm feeding it the binary. 
> 
> The printf approach is probably not going to work - these
> are busy busy Zope servers. Instead, my plan, assuming that
> purify doesn't immediately spot a problem, is to change the
> code so that if it gets a dud GC object, it will just bust
> it out of the tree and let it leak, and print a message 
> saying so. Then I can quit the program, and purify will
> tell me 'hey, you leaked!' and also tell me where it was
> allocated. 
> 
> More concerning, about half the segfaults are not from the
> GC at all, but from realloc in PyFrame_New (line 161 of
> frameobject). These are the only two I'm getting - it's 
> split 50-50 amongst the 10 coredumps I have now. I'm not
> sure whether to open a seperate bug for this. 
> 
> Has python2.1.1 been purified? With Zope and zope's 
> extensions?
> 
> 
> Wow - it's amazing how this SF bug thing is so painful for
> conversations :)

The ob_type pointer must be getting cleared after the object has been
added to the GC lists.  The PyObject_IS_GC call in _PyGC_Insert would
have segfaulted otherwise.  Knowing the type of the object would be
helpful in debugging the problem.  I suggest reconsidering Martin's
printf idea.  You could add something like this to _PyGC_Insert:

    void
    _PyGC_Insert(PyObject *op)
    {
        static int did_open = 0;
        static FILE *log;
        if (!did_open) {
            did_open = 1;
            log = fopen("type.log", "w");
        }
        fprintf(log, "%p %p\n", op, op->ob_type);
    ...

Debugging this type of problem is really hard (as you already know)
because the effect of the bug is found so far away from the source.

  Neil