[Python-bugs-list] [ python-Bugs-471942 ] python2.1.1 SEGV in GC on Solaris 2.7

noreply@sourceforge.net noreply@sourceforge.net
Tue, 04 Dec 2001 23:38:00 -0800


Bugs item #471942, was opened at 2001-10-16 19:56
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=471942&group_id=5470

Category: Python Interpreter Core
Group: Python 2.1.1
Status: Open
Resolution: None
Priority: 5
Submitted By: Anthony Baxter (anthonybaxter)
Assigned to: Anthony Baxter (anthonybaxter)
Summary: python2.1.1 SEGV in GC on Solaris 2.7

Initial Comment:
I've got a Zope installation where python2.1.1 is
segfaulting on Solaris2.7 - it's running a largish 
ZEO server. The tail of the gdb output is:

 #128 0x26164 in PyEval_CallObjectWithKeywords ()
 #129 0x264c0 in PyEval_CallObjectWithKeywords ()
 #130 0x26140 in PyEval_CallObjectWithKeywords ()
 #131 0x25fc0 in PyEval_CallObjectWithKeywords ()
 #132 0x517bc in PyInstance_New ()
 #133 0x261a4 in PyEval_CallObjectWithKeywords ()
 #134 0x25fc0 in PyEval_CallObjectWithKeywords ()
 #135 0x42c90 in initgc ()

It's built with 
<anthony@devhost1>$ gcc -v
Reading specs from
/opt/local/lib/gcc-lib/sparc-sun-solaris2.7/2.95.2/specs
gcc version 2.95.2 19991024 (release)
which is a bit old. 

I'm going to rebuild with gcc3.0 and also try turning 
off the GC. Unfortunately I can't get this to happen
on a smaller test system - it's only under load that 
it plows into the ground. 

I'll also leave symbols in this time... :/


----------------------------------------------------------------------

>Comment By: Anthony Baxter (anthonybaxter)
Date: 2001-12-04 23:38

Message:
Logged In: YES 
user_id=29957

Nah, this is a separate one to the frame bug. I'm still
seeing it on a regular regular basis, but haven't narrowed
it done to what's causing it. 

----------------------------------------------------------------------

Comment By: Jeremy Hylton (jhylton)
Date: 2001-12-04 08:48

Message:
Logged In: YES 
user_id=31392

This was the bug that was tracked down to a problem in
try/except/finally, wasn't it?  If this is fixed, can you
close the bug report?  If not, can you provide an udpate on
its current status?


----------------------------------------------------------------------

Comment By: Martin v. L÷wis (loewis)
Date: 2001-11-07 14:13

Message:
Logged In: YES 
user_id=21627

Any news on this report: did the problem disappear after
backporting changes?

----------------------------------------------------------------------

Comment By: Martin v. L÷wis (loewis)
Date: 2001-10-18 04:32

Message:
Logged In: YES 
user_id=21627

Looking at the changes since 2.1.1, I can see one bugfix
that might explain strange behaviour, namely ceval.c 2.277.
If you want to try it out, you can get the change here

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Python/ceval.c.diff?r1=2.276&r2=2.277

----------------------------------------------------------------------

Comment By: Anthony Baxter (anthonybaxter)
Date: 2001-10-18 03:59

Message:
Logged In: YES 
user_id=29957

What's the appropriate way to ask this? drop a line
to python-dev? purify's showing a UMR in strop_expandtabs
(from memory - don't have it on screen right now) when 
python starts up. 
Yeah, the realloc bug concerns me - I have to wonder if
something's a little bit/lot screwy. I just don't know 
where. 

----------------------------------------------------------------------

Comment By: Martin v. L÷wis (loewis)
Date: 2001-10-18 03:46

Message:
Logged In: YES 
user_id=21627

This is something to ask the pythonlabs folks; I believe
Python has been purified once in a while - perhaps even with
Zope extensions.
I'd still suspect a bug in the Zope extensions instead of a
Python bug, though: if malloc crashes, chances are high that
somebody was overwriting free memory.

----------------------------------------------------------------------

Comment By: Anthony Baxter (anthonybaxter)
Date: 2001-10-18 03:31

Message:
Logged In: YES 
user_id=29957

It's not a GC object. I'm positive all the extension 
objects are correct - I just recompiled, without the
1.5/2.0 headers around.
It's a different pointer each time round, unfortunately. It 
also takes anything from 5 minutes to 2 hours to reproduce.
I've got about 4 copies of it running now, and I've got a
bunch of different core files. I've grabbed purify and an
eval license, and I'm feeding it the binary. 

The printf approach is probably not going to work - these
are busy busy Zope servers. Instead, my plan, assuming that
purify doesn't immediately spot a problem, is to change the
code so that if it gets a dud GC object, it will just bust
it out of the tree and let it leak, and print a message 
saying so. Then I can quit the program, and purify will
tell me 'hey, you leaked!' and also tell me where it was
allocated. 

More concerning, about half the segfaults are not from the
GC at all, but from realloc in PyFrame_New (line 161 of
frameobject). These are the only two I'm getting - it's 
split 50-50 amongst the 10 coredumps I have now. I'm not
sure whether to open a seperate bug for this. 

Has python2.1.1 been purified? With Zope and zope's 
extensions?


Wow - it's amazing how this SF bug thing is so painful for
conversations :)


----------------------------------------------------------------------

Comment By: Martin v. L÷wis (loewis)
Date: 2001-10-18 03:11

Message:
Logged In: YES 
user_id=21627

There are two options:

a) the object isn't really a GC object, i.e. has no GC
header. In gdb, you can try to cast gc to PyObject*, then
see if the resulting pointer has a better ob_type (this is
unlikely, though, since the logic entering the object was
already using fromgc/togc)

b) somebody has cleared the ob_type field.

Can you guarantee that all extension modules have been
compiled with the 2.1.1 header files?

Is the problem repeatable in the sense that gc will have the
same pointer value on each crash? If so, it is relatively
easy to track down: just set a gdb change watchpoint on the
address on the ob_type field of that address (note that
setting watchpoints is not possible until there is really
mapped memory on that address).

If you can't analyse it through change breakpoints, I
recommend to annotate the interpreter in the following way:
in pyobject_init, put a printf that prints the address and
the tp_name of the type. In subtract_refs, if the ob_type
slot is null, print the address of the object and abort.
Then analyse the log to see whether a object really has been
allocated on that address, and what its type was (make sure
you consider the possibility that address are off by the
delta that FROM_GC adds).

----------------------------------------------------------------------

Comment By: Anthony Baxter (anthonybaxter)
Date: 2001-10-17 21:58

Message:
Logged In: YES 
user_id=29957

Ok, I have an intact core file, and a matching binary,
no optimisations, nothing. This crash is showing the
crash at line 166 of gcmodule.c
 traverse = PyObject_FROM_GC(gc)->ob_type->tp_traverse;
PyObject_FROM_GC(gc)->ob_type in this case is

$24 = {ob_refcnt = 1, ob_type = 0x0}

To check my logic, I checked gc_next and gc_prev using 
the same GDB magic, and they correctly show up as a tuple
and an instance method. 

Some fiddling around seems to rule out stack space as the
problem, as well. We're going to try and see if purify 
helps here, but the problem looks to be a junk object - 
I have no idea how to track this down further. Help?
Would taking the horrible horrible hack of removing the
object from the gc linked list if ob_type is null help?
Well, it'd stop the crashes, anyway.


----------------------------------------------------------------------

Comment By: Martin v. L÷wis (loewis)
Date: 2001-10-17 13:44

Message:
Logged In: YES 
user_id=21627

It would be interesting what the value of "gc" is at the 
time of the crash. It looks like you got an object that 
claims to support GC but has a null tp_traverse.


----------------------------------------------------------------------

Comment By: Anthony Baxter (anthonybaxter)
Date: 2001-10-17 06:08

Message:
Logged In: YES 
user_id=29957

I'm a doofus who read the gdb trace from the wrong
end - too much python lately :)
Nonetheless, the other end of the trace failed in 
gc as well - and building without GC enabled worked.

Here's the trace with debugging enabled:

#0  0xff00 in ?? ()
#1  0x402f0 in collect (young=0x9b538, old=0x9b544) at
./Modules/gcmodule.c:379
#2  0x405a8 in collect_generations () at
./Modules/gcmodule.c:484
#3  0x40624 in _PyGC_Insert (op=0xbc1f24) at
./Modules/gcmodule.c:507
#4  0x5a224 in PyList_New (size=0) at Objects/listobject.c:61
#5  0x21bc8 in eval_code2 (co=0x1cb370, globals=0x21bc0,
locals=0x67,
    args=0x0, argcount=1, kws=0xf89b24, kwcount=0, defs=0x0,
defcount=0,
    closure=0xbc1f24) at Python/ceval.c:1741

Next trick is to rebuild without any optimisation (sigh)
as I suspect that it's inlined subtract_refs().



----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=471942&group_id=5470