[Python-bugs-list] [ python-Bugs-471942 ] python2.1.1 SEGV in GC on Solaris 2.7
noreply@sourceforge.net
noreply@sourceforge.net
Thu, 10 Jan 2002 06:37:24 -0800
Bugs item #471942, was opened at 2001-10-16 19:56
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=471942&group_id=5470
Category: Python Interpreter Core
Group: Python 2.1.1
Status: Closed
Resolution: Fixed
Priority: 5
Submitted By: Anthony Baxter (anthonybaxter)
Assigned to: Anthony Baxter (anthonybaxter)
Summary: python2.1.1 SEGV in GC on Solaris 2.7
Initial Comment:
I've got a Zope installation where python2.1.1 is
segfaulting on Solaris2.7 - it's running a largish
ZEO server. The tail of the gdb output is:
#128 0x26164 in PyEval_CallObjectWithKeywords ()
#129 0x264c0 in PyEval_CallObjectWithKeywords ()
#130 0x26140 in PyEval_CallObjectWithKeywords ()
#131 0x25fc0 in PyEval_CallObjectWithKeywords ()
#132 0x517bc in PyInstance_New ()
#133 0x261a4 in PyEval_CallObjectWithKeywords ()
#134 0x25fc0 in PyEval_CallObjectWithKeywords ()
#135 0x42c90 in initgc ()
It's built with
<anthony@devhost1>$ gcc -v
Reading specs from
/opt/local/lib/gcc-lib/sparc-sun-solaris2.7/2.95.2/specs
gcc version 2.95.2 19991024 (release)
which is a bit old.
I'm going to rebuild with gcc3.0 and also try turning
off the GC. Unfortunately I can't get this to happen
on a smaller test system - it's only under load that
it plows into the ground.
I'll also leave symbols in this time... :/
----------------------------------------------------------------------
>Comment By: Guido van Rossum (gvanrossum)
Date: 2002-01-10 06:37
Message:
Logged In: YES
user_id=6380
Yay!
----------------------------------------------------------------------
Comment By: Anthony Baxter (anthonybaxter)
Date: 2002-01-10 03:17
Message:
Logged In: YES
user_id=29957
This is almost certainly fixed by the compiler fixes -
I can't get it to crash now. Closing.
----------------------------------------------------------------------
Comment By: Anthony Baxter (anthonybaxter)
Date: 2001-12-22 19:09
Message:
Logged In: YES
user_id=29957
It seems very likely that the recent compiler stack
calculation fixes are responsible for this - I plan to
look into this when I get back into work on the 3rd.
----------------------------------------------------------------------
Comment By: Joseph Wayne Norton (natsukashi)
Date: 2001-12-16 16:31
Message:
Logged In: YES
user_id=358486
No, I haven't tried the ceval.c patch. Do know of any
documentation regarding this bug and the bug fix? I'm just
curious before applying this patch.
There also is an update from the folks at ZC. They located
a source of trouble with respect to python 2.1 and the
restricted dtml engine. See the zope-dev mailing list for
futher details.
----------------------------------------------------------------------
Comment By: Martin v. Löwis (loewis)
Date: 2001-12-14 02:46
Message:
Logged In: YES
user_id=21627
natsukashi, have you applied 2.277 of ceval.c?
----------------------------------------------------------------------
Comment By: Joseph Wayne Norton (natsukashi)
Date: 2001-12-13 16:41
Message:
Logged In: YES
user_id=358486
Here is an excerpt from one debugging session ...
> (gdb) info threads
> 17 Thread 10 0xef5b9810 in _lwp_sema_wait ()
> 16 Thread 9 0xef647cac in _swtch ()
> 15 Thread 8 0xef5b9810 in _lwp_sema_wait ()
> 14 Thread 7 (LWP 5) 0xcaeb50 in ?? ()
> 13 Thread 6 0xef647cac in _swtch ()
> 12 Thread 5 0xef5b9810 in _lwp_sema_wait ()
> 11 Thread 4 0xef647cac in _swtch ()
> 10 Thread 3 0xef647cac in _swtch ()
> 9 Thread 2 (LWP 2) 0xef5b9958 in _signotifywait ()
> 8 Thread 1 (LWP 6) 0xef5b7488 in _poll ()
> 7 LWP 8 0xef5b6a24 in door_restart ()
> 6 LWP 6 0xef5b7488 in _poll ()
> 5 LWP 5 0xcaeb50 in ?? ()
> 4 LWP 4 0xef5b9810 in _lwp_sema_wait ()
> 3 LWP 3 0xef5b9810 in _lwp_sema_wait ()
> 2 LWP 2 0xef5b9958 in _signotifywait ()
> * 1 LWP 1 0xef5b9810 in _lwp_sema_wait ()
>
> (gdb) thread 14
> [Switching to Thread 7 (LWP 5)]
> #0 0xcaeb50 in ?? ()
>
> (gdb) where
> #0 0xcaeb50 in ?? ()
> #1 0x516bc in collect (young=0x13dec8, old=0x13ded4)
> at ./Modules/gcmodule.c:379
> #2 0x51984 in collect_generations () at
./Modules/gcmodule.c:484
> #3 0x519fc in _PyGC_Insert (op=0xecf7d4) at
./Modules/gcmodule.c:507
> #4 0x664ec in PyMethod_New (func=0x3f796c, self=0x11c0d44,
> class=0x3c7e5c)
> at Objects/classobject.c:1834
> #5 0x63850 in instance_getattr2 (inst=0x11c0d44,
name=0x3d5378)
> at Objects/classobject.c:642
> #6 0x63750 in instance_getattr1 (inst=0x11c0d44,
name=0x3d5378)
> at Objects/classobject.c:608
> #7 0x63898 in instance_getattr (inst=0x11c0d44,
name=0x3d5378)
> at Objects/classobject.c:656
> #8 0x78330 in PyObject_GetAttr (v=0x11c0d44, name=0x3d5378)
> at Objects/object.c:1052
> #9 0x895ec in builtin_hasattr (self=0x0, args=0x12ed944)
> at Python/bltinmodule.c:886
> #10 0x35a44 in call_cfunction (func=0x1609b0,
arg=0x12ed944, kw=0x0)
> at Python/ceval.c:2854
Unfortunately, I do not have any specific information at the
moment. There is also portions of a discussion that can be
viewed with the following url:
http://lists.zope.org/pipermail/zope-dev/2001-December/014541.html
Look for messages having the same subject.
Basically, we are running python 2.1.1 and zope 2.4.3 on the
solaris platform. We have 3 servers each having the same
installed software except one server is solaris 5.6 and the
other two are solaris 5.7.
The zope process running on the solaris 5.6 machine is
restarting at least a couple times a day due to a SIGSEGV or
SIGILL signal. I can generate a core file using the truss
command. The problem appears to be related to garbage
collection.
Unfortunately, this behavior only occurs on our production
site and depends on the site load -- higher system load,
more frequent restarts. We are not facing any restart
troubles on the solaris 5.7 machines or on our linux and
solaris development machines.
----------------------------------------------------------------------
Comment By: Tim Peters (tim_one)
Date: 2001-12-13 12:12
Message:
Logged In: YES
user_id=31435
Just noting that the function names reported in the
traceback in the original writeup are bogus:
PyEval_CallObjectWithKeywords() doesn't call itself. Looks
like gdb doesn't have enough info to report the true
function name, so picks a function it does know about at an
earlier address (or something <wink>). I'm told this
doesn't happen on Linux, but don't grasp why it might on
Solaris.
----------------------------------------------------------------------
Comment By: Martin v. Löwis (loewis)
Date: 2001-12-13 07:45
Message:
Logged In: YES
user_id=21627
What do you mean with "the same trouble"? Merely that Python
crashes, or do you have some more specific information to
contribute?
----------------------------------------------------------------------
Comment By: Joseph Wayne Norton (natsukashi)
Date: 2001-12-11 18:37
Message:
Logged In: YES
user_id=358486
Do you have any updates on this issue?
I'm facing the same trouble on the following solaris platform:
SunOS i04sv2008 5.6 Generic_105181-25 sun4u sparc SUNW,Ultra-80
regards,
- j
----------------------------------------------------------------------
Comment By: Anthony Baxter (anthonybaxter)
Date: 2001-12-04 23:38
Message:
Logged In: YES
user_id=29957
Nah, this is a separate one to the frame bug. I'm still
seeing it on a regular regular basis, but haven't narrowed
it done to what's causing it.
----------------------------------------------------------------------
Comment By: Jeremy Hylton (jhylton)
Date: 2001-12-04 08:48
Message:
Logged In: YES
user_id=31392
This was the bug that was tracked down to a problem in
try/except/finally, wasn't it? If this is fixed, can you
close the bug report? If not, can you provide an udpate on
its current status?
----------------------------------------------------------------------
Comment By: Martin v. Löwis (loewis)
Date: 2001-11-07 14:13
Message:
Logged In: YES
user_id=21627
Any news on this report: did the problem disappear after
backporting changes?
----------------------------------------------------------------------
Comment By: Martin v. Löwis (loewis)
Date: 2001-10-18 04:32
Message:
Logged In: YES
user_id=21627
Looking at the changes since 2.1.1, I can see one bugfix
that might explain strange behaviour, namely ceval.c 2.277.
If you want to try it out, you can get the change here
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Python/ceval.c.diff?r1=2.276&r2=2.277
----------------------------------------------------------------------
Comment By: Anthony Baxter (anthonybaxter)
Date: 2001-10-18 03:59
Message:
Logged In: YES
user_id=29957
What's the appropriate way to ask this? drop a line
to python-dev? purify's showing a UMR in strop_expandtabs
(from memory - don't have it on screen right now) when
python starts up.
Yeah, the realloc bug concerns me - I have to wonder if
something's a little bit/lot screwy. I just don't know
where.
----------------------------------------------------------------------
Comment By: Martin v. Löwis (loewis)
Date: 2001-10-18 03:46
Message:
Logged In: YES
user_id=21627
This is something to ask the pythonlabs folks; I believe
Python has been purified once in a while - perhaps even with
Zope extensions.
I'd still suspect a bug in the Zope extensions instead of a
Python bug, though: if malloc crashes, chances are high that
somebody was overwriting free memory.
----------------------------------------------------------------------
Comment By: Anthony Baxter (anthonybaxter)
Date: 2001-10-18 03:31
Message:
Logged In: YES
user_id=29957
It's not a GC object. I'm positive all the extension
objects are correct - I just recompiled, without the
1.5/2.0 headers around.
It's a different pointer each time round, unfortunately. It
also takes anything from 5 minutes to 2 hours to reproduce.
I've got about 4 copies of it running now, and I've got a
bunch of different core files. I've grabbed purify and an
eval license, and I'm feeding it the binary.
The printf approach is probably not going to work - these
are busy busy Zope servers. Instead, my plan, assuming that
purify doesn't immediately spot a problem, is to change the
code so that if it gets a dud GC object, it will just bust
it out of the tree and let it leak, and print a message
saying so. Then I can quit the program, and purify will
tell me 'hey, you leaked!' and also tell me where it was
allocated.
More concerning, about half the segfaults are not from the
GC at all, but from realloc in PyFrame_New (line 161 of
frameobject). These are the only two I'm getting - it's
split 50-50 amongst the 10 coredumps I have now. I'm not
sure whether to open a seperate bug for this.
Has python2.1.1 been purified? With Zope and zope's
extensions?
Wow - it's amazing how this SF bug thing is so painful for
conversations :)
----------------------------------------------------------------------
Comment By: Martin v. Löwis (loewis)
Date: 2001-10-18 03:11
Message:
Logged In: YES
user_id=21627
There are two options:
a) the object isn't really a GC object, i.e. has no GC
header. In gdb, you can try to cast gc to PyObject*, then
see if the resulting pointer has a better ob_type (this is
unlikely, though, since the logic entering the object was
already using fromgc/togc)
b) somebody has cleared the ob_type field.
Can you guarantee that all extension modules have been
compiled with the 2.1.1 header files?
Is the problem repeatable in the sense that gc will have the
same pointer value on each crash? If so, it is relatively
easy to track down: just set a gdb change watchpoint on the
address on the ob_type field of that address (note that
setting watchpoints is not possible until there is really
mapped memory on that address).
If you can't analyse it through change breakpoints, I
recommend to annotate the interpreter in the following way:
in pyobject_init, put a printf that prints the address and
the tp_name of the type. In subtract_refs, if the ob_type
slot is null, print the address of the object and abort.
Then analyse the log to see whether a object really has been
allocated on that address, and what its type was (make sure
you consider the possibility that address are off by the
delta that FROM_GC adds).
----------------------------------------------------------------------
Comment By: Anthony Baxter (anthonybaxter)
Date: 2001-10-17 21:58
Message:
Logged In: YES
user_id=29957
Ok, I have an intact core file, and a matching binary,
no optimisations, nothing. This crash is showing the
crash at line 166 of gcmodule.c
traverse = PyObject_FROM_GC(gc)->ob_type->tp_traverse;
PyObject_FROM_GC(gc)->ob_type in this case is
$24 = {ob_refcnt = 1, ob_type = 0x0}
To check my logic, I checked gc_next and gc_prev using
the same GDB magic, and they correctly show up as a tuple
and an instance method.
Some fiddling around seems to rule out stack space as the
problem, as well. We're going to try and see if purify
helps here, but the problem looks to be a junk object -
I have no idea how to track this down further. Help?
Would taking the horrible horrible hack of removing the
object from the gc linked list if ob_type is null help?
Well, it'd stop the crashes, anyway.
----------------------------------------------------------------------
Comment By: Martin v. Löwis (loewis)
Date: 2001-10-17 13:44
Message:
Logged In: YES
user_id=21627
It would be interesting what the value of "gc" is at the
time of the crash. It looks like you got an object that
claims to support GC but has a null tp_traverse.
----------------------------------------------------------------------
Comment By: Anthony Baxter (anthonybaxter)
Date: 2001-10-17 06:08
Message:
Logged In: YES
user_id=29957
I'm a doofus who read the gdb trace from the wrong
end - too much python lately :)
Nonetheless, the other end of the trace failed in
gc as well - and building without GC enabled worked.
Here's the trace with debugging enabled:
#0 0xff00 in ?? ()
#1 0x402f0 in collect (young=0x9b538, old=0x9b544) at
./Modules/gcmodule.c:379
#2 0x405a8 in collect_generations () at
./Modules/gcmodule.c:484
#3 0x40624 in _PyGC_Insert (op=0xbc1f24) at
./Modules/gcmodule.c:507
#4 0x5a224 in PyList_New (size=0) at Objects/listobject.c:61
#5 0x21bc8 in eval_code2 (co=0x1cb370, globals=0x21bc0,
locals=0x67,
args=0x0, argcount=1, kws=0xf89b24, kwcount=0, defs=0x0,
defcount=0,
closure=0xbc1f24) at Python/ceval.c:1741
Next trick is to rebuild without any optimisation (sigh)
as I suspect that it's inlined subtract_refs().
----------------------------------------------------------------------
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=471942&group_id=5470