[Python-bugs-list] [ python-Bugs-471942 ] python2.1.1 SEGV in GC on Solaris 2.7

Sat, 22 Dec 2001 19:09:52 -0800

Bugs item #471942, was opened at 2001-10-16 19:56
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=471942&group_id=5470

Category: Python Interpreter Core
Group: Python 2.1.1
Status: Open
Resolution: None
Priority: 5
Submitted By: Anthony Baxter (anthonybaxter)
Assigned to: Anthony Baxter (anthonybaxter)
Summary: python2.1.1 SEGV in GC on Solaris 2.7

Initial Comment:
I've got a Zope installation where python2.1.1 is
segfaulting on Solaris2.7 - it's running a largish 
ZEO server. The tail of the gdb output is:

 #128 0x26164 in PyEval_CallObjectWithKeywords ()
 #129 0x264c0 in PyEval_CallObjectWithKeywords ()
 #130 0x26140 in PyEval_CallObjectWithKeywords ()
 #131 0x25fc0 in PyEval_CallObjectWithKeywords ()
 #132 0x517bc in PyInstance_New ()
 #133 0x261a4 in PyEval_CallObjectWithKeywords ()
 #134 0x25fc0 in PyEval_CallObjectWithKeywords ()
 #135 0x42c90 in initgc ()

It's built with 
<anthony@devhost1>$ gcc -v
Reading specs from
/opt/local/lib/gcc-lib/sparc-sun-solaris2.7/2.95.2/specs
gcc version 2.95.2 19991024 (release)
which is a bit old. 

I'm going to rebuild with gcc3.0 and also try turning 
off the GC. Unfortunately I can't get this to happen
on a smaller test system - it's only under load that 
it plows into the ground. 

I'll also leave symbols in this time... :/

----------------------------------------------------------------------

>Comment By: Anthony Baxter (anthonybaxter)
Date: 2001-12-22 19:09

Message:
Logged In: YES 
user_id=29957

It seems very likely that the recent compiler stack
calculation fixes are responsible for this - I plan to 
look into this when I get back into work on the 3rd. 

----------------------------------------------------------------------

Comment By: Joseph Wayne Norton (natsukashi)
Date: 2001-12-16 16:31

Message:
Logged In: YES 
user_id=358486

No, I haven't tried the ceval.c patch.  Do know of any
documentation regarding this bug and the bug fix? I'm just
curious before applying this patch.

There also is an update from the folks at ZC.  They located
a source of trouble with respect to python 2.1 and the
restricted dtml engine.  See the zope-dev mailing list for
futher details.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2001-12-14 02:46

Message:
Logged In: YES 
user_id=21627

natsukashi, have you applied 2.277 of ceval.c?

----------------------------------------------------------------------

Comment By: Joseph Wayne Norton (natsukashi)
Date: 2001-12-13 16:41

Message:
Logged In: YES 
user_id=358486

Here is an excerpt from one debugging session ...

> (gdb) info threads 
>   17 Thread 10          0xef5b9810 in _lwp_sema_wait () 
>   16 Thread 9          0xef647cac in _swtch () 
>   15 Thread 8          0xef5b9810 in _lwp_sema_wait () 
>   14 Thread 7 (LWP 5)  0xcaeb50 in ?? () 
>   13 Thread 6          0xef647cac in _swtch () 
>   12 Thread 5          0xef5b9810 in _lwp_sema_wait () 
>   11 Thread 4          0xef647cac in _swtch () 
>   10 Thread 3          0xef647cac in _swtch () 
>   9 Thread 2 (LWP 2)  0xef5b9958 in _signotifywait () 
>   8 Thread 1 (LWP 6)  0xef5b7488 in _poll () 
>   7 LWP    8          0xef5b6a24 in door_restart () 
>   6 LWP    6          0xef5b7488 in _poll () 
>   5 LWP    5          0xcaeb50 in ?? () 
>   4 LWP    4          0xef5b9810 in _lwp_sema_wait () 
>   3 LWP    3          0xef5b9810 in _lwp_sema_wait () 
>   2 LWP    2          0xef5b9958 in _signotifywait () 
> * 1 LWP    1          0xef5b9810 in _lwp_sema_wait () 
> 
> (gdb) thread 14 
> [Switching to Thread 7 (LWP 5)] 
> #0  0xcaeb50 in ?? () 
> 
> (gdb) where 
> #0  0xcaeb50 in ?? () 
> #1  0x516bc in collect (young=0x13dec8, old=0x13ded4) 
>     at ./Modules/gcmodule.c:379 
> #2  0x51984 in collect_generations () at
./Modules/gcmodule.c:484 
> #3  0x519fc in _PyGC_Insert (op=0xecf7d4) at
./Modules/gcmodule.c:507 
> #4  0x664ec in PyMethod_New (func=0x3f796c, self=0x11c0d44, 
> class=0x3c7e5c) 
>     at Objects/classobject.c:1834 
> #5  0x63850 in instance_getattr2 (inst=0x11c0d44,
name=0x3d5378) 
>     at Objects/classobject.c:642 
> #6  0x63750 in instance_getattr1 (inst=0x11c0d44,
name=0x3d5378) 
>     at Objects/classobject.c:608 
> #7  0x63898 in instance_getattr (inst=0x11c0d44,
name=0x3d5378) 
>     at Objects/classobject.c:656 
> #8  0x78330 in PyObject_GetAttr (v=0x11c0d44, name=0x3d5378) 
>     at Objects/object.c:1052 
> #9  0x895ec in builtin_hasattr (self=0x0, args=0x12ed944) 
>     at Python/bltinmodule.c:886 
> #10 0x35a44 in call_cfunction (func=0x1609b0,
arg=0x12ed944, kw=0x0) 
>     at Python/ceval.c:2854 

Unfortunately, I do not have any specific information at the
moment.  There is also portions of a discussion that can be
viewed with the following url:
http://lists.zope.org/pipermail/zope-dev/2001-December/014541.html
Look for messages having the same subject.

Basically, we are running python 2.1.1 and zope 2.4.3 on the
solaris platform.  We have 3 servers each having the same
installed software except one server is solaris 5.6 and the
other two are solaris 5.7.

The zope process running on the solaris 5.6 machine is
restarting at least a couple times a day due to a SIGSEGV or
SIGILL signal.  I can generate a core file using the truss
command.  The problem appears to be related to garbage
collection.

Unfortunately, this behavior only occurs on our production
site and depends on the site load -- higher system load,
more frequent restarts.  We are not facing any restart
troubles on the solaris 5.7 machines or on our linux and
solaris development machines.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-12-13 12:12

Message:
Logged In: YES 
user_id=31435

Just noting that the function names reported in the 
traceback in the original writeup are bogus:  
PyEval_CallObjectWithKeywords() doesn't call itself.  Looks 
like gdb doesn't have enough info to report the true 
function name, so picks a function it does know about at an 
earlier address (or something <wink>).  I'm told this 
doesn't happen on Linux, but don't grasp why it might on 
Solaris.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2001-12-13 07:45

Message:
Logged In: YES 
user_id=21627

What do you mean with "the same trouble"? Merely that Python
crashes, or do you have some more specific information to
contribute?

----------------------------------------------------------------------

Comment By: Joseph Wayne Norton (natsukashi)
Date: 2001-12-11 18:37

Message:
Logged In: YES 
user_id=358486

Do you have any updates on this issue? 

I'm facing the same trouble on the following solaris platform:

SunOS i04sv2008 5.6 Generic_105181-25 sun4u sparc SUNW,Ultra-80

regards,

- j

----------------------------------------------------------------------

Comment By: Anthony Baxter (anthonybaxter)
Date: 2001-12-04 23:38

Message:
Logged In: YES 
user_id=29957

Nah, this is a separate one to the frame bug. I'm still
seeing it on a regular regular basis, but haven't narrowed
it done to what's causing it. 

----------------------------------------------------------------------

Comment By: Jeremy Hylton (jhylton)
Date: 2001-12-04 08:48

Message:
Logged In: YES 
user_id=31392

This was the bug that was tracked down to a problem in
try/except/finally, wasn't it?  If this is fixed, can you
close the bug report?  If not, can you provide an udpate on
its current status?

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2001-11-07 14:13

Message:
Logged In: YES 
user_id=21627

Any news on this report: did the problem disappear after
backporting changes?

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2001-10-18 04:32

Message:
Logged In: YES 
user_id=21627

Looking at the changes since 2.1.1, I can see one bugfix
that might explain strange behaviour, namely ceval.c 2.277.
If you want to try it out, you can get the change here

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Python/ceval.c.diff?r1=2.276&r2=2.277

----------------------------------------------------------------------

Comment By: Anthony Baxter (anthonybaxter)
Date: 2001-10-18 03:59

Message:
Logged In: YES 
user_id=29957

What's the appropriate way to ask this? drop a line
to python-dev? purify's showing a UMR in strop_expandtabs
(from memory - don't have it on screen right now) when 
python starts up. 
Yeah, the realloc bug concerns me - I have to wonder if
something's a little bit/lot screwy. I just don't know 
where. 

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2001-10-18 03:46

Message:
Logged In: YES 
user_id=21627

This is something to ask the pythonlabs folks; I believe
Python has been purified once in a while - perhaps even with
Zope extensions.
I'd still suspect a bug in the Zope extensions instead of a
Python bug, though: if malloc crashes, chances are high that
somebody was overwriting free memory.

----------------------------------------------------------------------

Comment By: Anthony Baxter (anthonybaxter)
Date: 2001-10-18 03:31

Message:
Logged In: YES 
user_id=29957

It's not a GC object. I'm positive all the extension 
objects are correct - I just recompiled, without the
1.5/2.0 headers around.
It's a different pointer each time round, unfortunately. It 
also takes anything from 5 minutes to 2 hours to reproduce.
I've got about 4 copies of it running now, and I've got a
bunch of different core files. I've grabbed purify and an
eval license, and I'm feeding it the binary. 

The printf approach is probably not going to work - these
are busy busy Zope servers. Instead, my plan, assuming that
purify doesn't immediately spot a problem, is to change the
code so that if it gets a dud GC object, it will just bust
it out of the tree and let it leak, and print a message 
saying so. Then I can quit the program, and purify will
tell me 'hey, you leaked!' and also tell me where it was
allocated. 

More concerning, about half the segfaults are not from the
GC at all, but from realloc in PyFrame_New (line 161 of
frameobject). These are the only two I'm getting - it's 
split 50-50 amongst the 10 coredumps I have now. I'm not
sure whether to open a seperate bug for this. 

Has python2.1.1 been purified? With Zope and zope's 
extensions?

Wow - it's amazing how this SF bug thing is so painful for
conversations :)

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2001-10-18 03:11

Message:
Logged In: YES 
user_id=21627

There are two options:

a) the object isn't really a GC object, i.e. has no GC
header. In gdb, you can try to cast gc to PyObject*, then
see if the resulting pointer has a better ob_type (this is
unlikely, though, since the logic entering the object was
already using fromgc/togc)

b) somebody has cleared the ob_type field.

Can you guarantee that all extension modules have been
compiled with the 2.1.1 header files?

Is the problem repeatable in the sense that gc will have the
same pointer value on each crash? If so, it is relatively
easy to track down: just set a gdb change watchpoint on the
address on the ob_type field of that address (note that
setting watchpoints is not possible until there is really
mapped memory on that address).

If you can't analyse it through change breakpoints, I
recommend to annotate the interpreter in the following way:
in pyobject_init, put a printf that prints the address and
the tp_name of the type. In subtract_refs, if the ob_type
slot is null, print the address of the object and abort.
Then analyse the log to see whether a object really has been
allocated on that address, and what its type was (make sure
you consider the possibility that address are off by the
delta that FROM_GC adds).

----------------------------------------------------------------------

Comment By: Anthony Baxter (anthonybaxter)
Date: 2001-10-17 21:58

Message:
Logged In: YES 
user_id=29957

Ok, I have an intact core file, and a matching binary,
no optimisations, nothing. This crash is showing the
crash at line 166 of gcmodule.c
 traverse = PyObject_FROM_GC(gc)->ob_type->tp_traverse;
PyObject_FROM_GC(gc)->ob_type in this case is

$24 = {ob_refcnt = 1, ob_type = 0x0}

To check my logic, I checked gc_next and gc_prev using 
the same GDB magic, and they correctly show up as a tuple
and an instance method. 

Some fiddling around seems to rule out stack space as the
problem, as well. We're going to try and see if purify 
helps here, but the problem looks to be a junk object - 
I have no idea how to track this down further. Help?
Would taking the horrible horrible hack of removing the
object from the gc linked list if ob_type is null help?
Well, it'd stop the crashes, anyway.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2001-10-17 13:44

Message:
Logged In: YES 
user_id=21627

It would be interesting what the value of "gc" is at the 
time of the crash. It looks like you got an object that 
claims to support GC but has a null tp_traverse.

----------------------------------------------------------------------

Comment By: Anthony Baxter (anthonybaxter)
Date: 2001-10-17 06:08

Message:
Logged In: YES 
user_id=29957

I'm a doofus who read the gdb trace from the wrong
end - too much python lately :)
Nonetheless, the other end of the trace failed in 
gc as well - and building without GC enabled worked.

Here's the trace with debugging enabled:

#0  0xff00 in ?? ()
#1  0x402f0 in collect (young=0x9b538, old=0x9b544) at
./Modules/gcmodule.c:379
#2  0x405a8 in collect_generations () at
./Modules/gcmodule.c:484
#3  0x40624 in _PyGC_Insert (op=0xbc1f24) at
./Modules/gcmodule.c:507
#4  0x5a224 in PyList_New (size=0) at Objects/listobject.c:61
#5  0x21bc8 in eval_code2 (co=0x1cb370, globals=0x21bc0,
locals=0x67,
    args=0x0, argcount=1, kws=0xf89b24, kwcount=0, defs=0x0,
defcount=0,
    closure=0xbc1f24) at Python/ceval.c:1741

Next trick is to rebuild without any optimisation (sigh)
as I suspect that it's inlined subtract_refs().

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=471942&group_id=5470