[Python-bugs-list] [ python-Bugs-545410 ] corefile: python 2.1.3, zope 2.5.0

noreply@sourceforge.net noreply@sourceforge.net
Thu, 25 Jul 2002 09:11:17 -0700


Bugs item #545410, was opened at 2002-04-17 23:15
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=545410&group_id=5470

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Geoff Gerrietts (ggerrietts)
Assigned to: Nobody/Anonymous (nobody)
Summary: corefile: python 2.1.3, zope 2.5.0

Initial Comment:
I regularly get a corefile out of Zope 2.5.0, running
on RedHat 6.2 and Python 2.1.3, usually within 5 or 6
page views.

Reproducing the problem requires (for me) starting up
Zope, going to the management interface, and bouncing
around on a few of the different objects. Sometimes the
first attempt to render the page will cause a crash,
but sometimes it takes a few clicks. After the crash,
Zope dumps core and politely restarts itself.

Traceback files are largely the same from one crash to
the next, with the only variation I've noted being the
addresses of variables -- this fits with the fact that
it takes a different number of steps each time.

Traceback files (infuriatingly enough) do not show line
number information for select.so, even though
selectmodule.o was compiled with -g specified.

Examination of the traceback shows that in stack frame
2, print ((PyCFunctionObject *)func) -> m_ml -> ml_name
reveals "select". In that same frame, print ((PyObject
*)arg) -> ob_type -> tp_name yields "Cannot  access
memory at address 0x1f".

One traceback has been provided. Others, and other
info, available on request.

----------------------------------------------------------------------

>Comment By: Jeremy Hylton (jhylton)
Date: 2002-07-25 16:11

Message:
Logged In: YES 
user_id=31392

Are you still having this problem?

If so, I would recommend trying to reproduce the crash with
Zope 2.6 alpha 1 and a Python 2.1.3 or 2.2.1 debug build
(configure --with-pydebug).


----------------------------------------------------------------------

Comment By: Geoff Gerrietts (ggerrietts)
Date: 2002-04-25 00:47

Message:
Logged In: YES 
user_id=66989

MALLOC_CHECK_ didn't do anything different for me. I don't
know what that means exactly, because it definitely SEEMS
like there's some bad memory management going on. Maybe it's
just not bad ENOUGH.

The arg in stackframe 2 is actually NOT in the shared
library space -- the highest shared library address is
actually in the 0x404xxxx range. I'm not sure how the
programs are linked/run, so I'm not sure what 0x405d7ffc
would likely be if it's not shared libraries, though.

I've been looking into (and continue to look into) the
possibility Guido suggested on python-dev, that this is an
overrun of a thread's stack. It seems unlikely, given that
stack threads are allotted 2MB under Linux, but I suppose
anything's possible. The Data.fs in use in this core is
180MB; when compressed, it shrinks to about 90MB. It's a big
site, so overruns are a lot more possible here than in other
places.

I'm going to take a pass at pulling out all the 3rd-party
extension modules, but that's going to be very challenging.
A significant part of the architecture (roughly half,
including authentication) is accessible only through a 3rd
party extension, ILU. My independent testing with ILU
indicates that ILU works acceptably and doesn't cause memory
corruption. Because it's so ingrained into the site, I'm
going to look at doing more thorough testing and more of it,
and I'll try to isolate it so I can test around it, but I
don't think it's reasonable to try to replace it at this
point in time; the engineering effort would be several
weeks.

There are (at least one) other 3rd party module(s) that I'll
be able to rip out or fake up more easily.

I think this is more of a progress report than anything
else. Thanks for the eyes.

----------------------------------------------------------------------

Comment By: Jeremy Hylton (jhylton)
Date: 2002-04-19 14:50

Message:
Logged In: YES 
user_id=31392

Are you using any third party Python extension modules with
Zope?  It may be a memory corruption problem in an
extension.  If you are, you should try to reproduce the
problem with *only* Zope's core extensions and Python 2.1.3.


----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-04-18 05:44

Message:
Logged In: YES 
user_id=21627

In that backtrace, it is clear that arg in stackframe 2 is
bogus: 0x405d7ffc points into an area that appears to be
used for shared libraries (please do "info shared" to
support this theory).

Now, arg presumably is the return value of load_args, where
it was created through PyTuple_New. This suggests that the
memory management got corrupted; something that likely
happened much earlier.

I recommend to set MALLOC_CHECK_ before starting Python. The
documentation in malloc(3) says that setting it to 2 will
cause an abort() when an error is found; from expecting the
implementation, it appears that setting it to 3 will combine
the debug traces printed and the call to abort; please
experiment with all three values (1,2,3).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=545410&group_id=5470