[Python-bugs-list] [ python-Bugs-493183 ] FreeBSD/Python 2.1.1 Segmentation fault

noreply@sourceforge.net noreply@sourceforge.net
Sat, 26 Jan 2002 23:05:52 -0800


Bugs item #493183, was opened at 2001-12-13 21:02
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=493183&group_id=5470

Category: Python Interpreter Core
Group: 3rd Party
Status: Closed
Resolution: Works For Me
Priority: 5
Submitted By: Jason R. Mastaler (jasonrm)
Assigned to: Nobody/Anonymous (nobody)
Summary: FreeBSD/Python 2.1.1 Segmentation fault

Initial Comment:
This is Python 2.1.1 on FreeBSD 4.4-RELEASE/i386.

I have a reproducible way to make Python 2.1.1 segfault
here, but I'm
not sure the best way to go about debugging it.  I'll
save you the
gory details unless you request them, but basically I
use a Python
program called "tmda-inject" to send my e-mail.  It's
core dumping
when fed this one particular message (mess1.txt), but
not others.
Further, Python 1.5.2, 2.0 and 2.2b2 on the same
machine have no such
trouble.  It only appears with Python 2.1.1.  I can
also reproduce
this problem a complately different machine also
running FreeBSD 4.4
and Python 2.1.1, but not with other operating
systems.  The problem
appears to be specifically with FreeBSD and Python
2.1.1.

I've attached the "mess1.txt" e-mail message, and also
python2.1.core.
If you'd like to look at tmda-inject, it's part of my
TMDA package
available at http://sf.net/projects/tmda/.  Let me know
if I can do
anything further.

[jasonrm@nightshade jasonrm]$ cat mess1.txt | python2.1
cvs/tmda/dist/bin/tmda-inject 
Segmentation fault (core dumped)

[jasonrm@nightshade jasonrm]$ cat mess1.txt | python1.5
cvs/tmda/dist/bin/tmda-inject 
[jasonrm@nightshade jasonrm]$ cat mess1.txt | python2.0
cvs/tmda/dist/bin/tmda-inject 
[jasonrm@nightshade jasonrm]$ cat mess1.txt | python2.2
cvs/tmda/dist/bin/tmda-inject 

[jasonrm@nightshade jasonrm]$ gdb
/usr/local/bin/python2.1 python2.1.core 
GNU gdb 4.18
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public
License, and you are
welcome to change it and/or distribute copies of it
under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show
warranty" for details.
This GDB was configured as "i386-unknown-freebsd"...(no
debugging symbols found)...
Core was generated by `python2.1'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /usr/lib/libutil.so.3...(no
debugging symbols found)...done.
Reading symbols from /usr/lib/libm.so.2...(no debugging
symbols found)...done.
Reading symbols from /usr/lib/libc_r.so.4...(no
debugging symbols found)...done.
Reading symbols from
/usr/local/lib/python2.1/lib-dynload/strop.so...(no
debugging symbols found)...done.
Reading symbols from
/usr/local/lib/python2.1/lib-dynload/time.so...(no
debugging symbols found)...done.
Reading symbols from
/usr/local/lib/python2.1/lib-dynload/binascii.so...(no
debugging symbols found)...done.
Reading symbols from
/usr/local/lib/python2.1/lib-dynload/sha.so...(no
debugging symbols found)...done.
Reading symbols from
/usr/local/lib/python2.1/site-packages/cdbmodule.so...(no
debugging symbols found)...done.
Reading symbols from /usr/libexec/ld-elf.so.1...(no
debugging symbols found)...done.
#0  0x28177e0e in _thread_sys_fsync () from
/usr/lib/libc_r.so.4
(gdb) where
#0  0x28177e0e in _thread_sys_fsync () from
/usr/lib/libc_r.so.4
#1  0x2818ea24 in _thread_autoinit_dummy_decl () from
/usr/lib/libc_r.so.4
#2  0x28179222 in vfprintf () from /usr/lib/libc_r.so.4
#3  0x28169f75 in sprintf () from /usr/lib/libc_r.so.4
#4  0x80934ae in _PyString_FormatLong ()
#5  0x8093d48 in PyString_Format ()
#6  0x807b868 in PyNumber_Remainder ()
#7  0x80556a2 in PyEval_EvalCode ()
#8  0x8058d95 in PyEval_CallObjectWithKeywords ()
#9  0x80574cd in PyEval_EvalCode ()
#10 0x8058d95 in PyEval_CallObjectWithKeywords ()
#11 0x80574cd in PyEval_EvalCode ()
#12 0x8058d95 in PyEval_CallObjectWithKeywords ()
#13 0x80574cd in PyEval_EvalCode ()
#14 0x8058d95 in PyEval_CallObjectWithKeywords ()
#15 0x80574cd in PyEval_EvalCode ()
#16 0x8058d95 in PyEval_CallObjectWithKeywords ()
#17 0x80574cd in PyEval_EvalCode ()
#18 0x8058d95 in PyEval_CallObjectWithKeywords ()
#19 0x80574cd in PyEval_EvalCode ()
#20 0x80548a8 in PyEval_EvalCode ()
#21 0x806d75d in PyRun_FileExFlags ()
#22 0x806d70e in PyRun_FileExFlags ()
#23 0x806d6de in PyRun_FileExFlags ()
#24 0x806cab0 in PyRun_SimpleFileExFlags ()
#25 0x806c5b0 in PyRun_AnyFileExFlags ()
#26 0x8051795 in Py_Main ()
#27 0x80511c2 in main ()
#28 0x8051111 in _start ()
(gdb) 


----------------------------------------------------------------------

>Comment By: Jason R. Mastaler (jasonrm)
Date: 2002-01-26 23:05

Message:
Logged In: YES 
user_id=85984

Just for the record, the python-cdb author and I
were able to track down the memory corruption bug.
This was fixed in version 0.32 of the module.


----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2002-01-02 13:43

Message:
Logged In: NO 

Right. Bugs in 3rd party modules are very common causes
of core dumps, and often there's very little that points
to the culprit, because all they do is corrupt memory,
placing a landmine for some other innocent code to step
on.

--Guido (can't log in here on Windows :-)

----------------------------------------------------------------------

Comment By: Jason R. Mastaler (jasonrm)
Date: 2002-01-02 12:06

Message:
Logged In: YES 
user_id=85984

Here is something very strange.  I can now reproduce
the segfault nearly each time, not just with one particular
message.  I installed an extension module this morning
and after that point things went bad in a hurry.  The 
module is python-cdb
(http://pilcrow.madison.wi.us/sw/python-cdb-0.22a.tar.gz)

This module was also present in my Python2.1 installation.
It appears that removing the cdbmodule.so from my
site-packages directory, or not using it in my program
clears up the segfault, but as soon as I put it back, 
Python goes right back to segfaulting.

I guess I'll contact the python-cdb maintainer about this.


----------------------------------------------------------------------

Comment By: Jason R. Mastaler (jasonrm)
Date: 2002-01-02 09:49

Message:
Logged In: YES 
user_id=85984

This problem has crept up again for the second
time now.  This time with Python 2.2, and
strangely I can't reproduce it with any other
Python version.  As was the case last time,
the hexlify() function is what is causing the
segfault.

I've attached backtrace_python2.2.txt -- it
looks a bit more verbose than previous
backtraces.  Does this tell us anything new,
or is it still a mystery?


----------------------------------------------------------------------

Comment By: Jason R. Mastaler (jasonrm)
Date: 2001-12-18 09:06

Message:
Logged In: YES 
user_id=85984

Well, this bug does have to do with Python if
only in the configure scripts and such.  There
is something in Python 2.1.1 that is triggering
this that isn't in Python 1.5.2, 2.0 or 2.2.

I do agree though that the core of the problem
is in FreeBSD, Python 2.1.1 is just bringing it
to the surface.

There is also some evidence that this problem
is fixed in later FreeBSD releases.  For example,
I can't reproduce the problem on a 4.4-STABLE
box (a couple months newer than my 4.4-RELEASE
boxes).  It would be interesting to see if this problem
"dissapears" when I upgrade to 4.5-RELEASE 
next month.


----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-12-18 08:47

Message:
Logged In: YES 
user_id=31435

No, don't do it for Python's sake -- there's no reason to 
believe this has anything to do with Python, but is an 
obscure bug in the FreeBSD libc (which *has* been triggered 
by non-Python apps too, such as mysql).  So I predict 
you'll see it again; if you pursued it, it would be for the 
benefit of FreeBSD users.

----------------------------------------------------------------------

Comment By: Jason R. Mastaler (jasonrm)
Date: 2001-12-18 08:36

Message:
Logged In: YES 
user_id=85984

Tim, if you want me to pursue the issue with a
FreeBSD libc internist for the good of Python, 
I'll do so.  But otherwise, I'll let this one go as I
don't plan on using Python 2.1.1 any longer.
There is no evidence that this problem exists 
with Python 2.2.


----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-12-17 14:46

Message:
Logged In: YES 
user_id=31435

Note that the same kind of mysterious bug was reported 
against mysql on FreeBSD (I mentioned this several notes 
ago).  There's really no evidence of a Python bug here.  
Find someone knowledgeable about libc internals on 
FreeBSD:  that's where the evidence points.  They may also 
be able to help you figure out why you're getting a mix of 
threaded and non-threaded libcs; I can't say whether that's 
a problem, but it sure doesn't look right.

----------------------------------------------------------------------

Comment By: Jason R. Mastaler (jasonrm)
Date: 2001-12-17 14:14

Message:
Logged In: YES 
user_id=85984

I'd say just close it out and keep it in the back of
your mind in case something similar pops up in
the future.  I'll do the same.

The bug is obscure and easy to workaround anyway.
Thanks for your help with this.  At the very least it
has been an educational experience for me.  


----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2001-12-17 14:10

Message:
Logged In: YES 
user_id=6380

So do you want me to keep this open? I don't expect we'll
find the answer.

----------------------------------------------------------------------

Comment By: Jason R. Mastaler (jasonrm)
Date: 2001-12-17 14:07

Message:
Logged In: YES 
user_id=85984

Your prediction is correct:

[root@nightshade Python-2.1.1]# ./python 
df
Python 2.1.1 (#1, Dec 17 2001, 15:01:24) 
[GCC 2.95.3 20010315 (release) [FreeBSD]] on freebsd4
Type "copyright", "credits" or "license" for more
information.
>>> 


----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2001-12-16 21:13

Message:
Logged In: YES 
user_id=6380

Add these three lines to Py_Main() in Modules/main.c, after
the line saying "PyCompilerFlags cf;":

	char buffer[120];
	sprintf(buffer, "%.1lx", 223L);
	printf("%s\n", buffer);

and rebuild Python. When you run Python, it should print
"df" before the standard banner. If it crashes, we've proved
that Python is not the cause of the problem. But I predict
that it'll work. :-(

----------------------------------------------------------------------

Comment By: Jason R. Mastaler (jasonrm)
Date: 2001-12-16 21:01

Message:
Logged In: YES 
user_id=85984

As far as the standalone program, I'm not a C programmer,
(I've become Python-dependent :-), but if you can give me
an idea of what it should contain, I'll give it a try.
 

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2001-12-16 20:23

Message:
Logged In: YES 
user_id=6380

Sigh. I think it's a dead end. What about a stand-alone
program?

----------------------------------------------------------------------

Comment By: Jason R. Mastaler (jasonrm)
Date: 2001-12-16 19:09

Message:
Logged In: YES 
user_id=85984

Tim asked whather I can verify whether I'm mixing the
threaded and non-threaded versions of libc.  Is the
following enough to tell?

$ ldd /usr/local/src/Python-2.1.1/python
/usr/local/src/Python-2.1.1/python:
        libc_r.so.4 => /usr/lib/libc_r.so.4 (0x280ca000)
        libutil.so.3 => /usr/lib/libutil.so.3 (0x28182000)
        libm.so.2 => /usr/lib/libm.so.2 (0x2818b000)
        libc.so.4 => /usr/lib/libc.so.4 (0x281a7000)

libc is the non-threaded library, and libc_r is the
threaded version, so it looks like they are being
mixed, yes?

However, my Python 2.0.1 also looks the same way, and
I'm not having any problems with it.

$ ldd /usr/local/src/Python-2.0.1/python
/usr/local/src/Python-2.0.1/python:
        libc_r.so.4 => /usr/lib/libc_r.so.4 (0x2816c000)
        libutil.so.3 => /usr/lib/libutil.so.3 (0x28224000)
        libm.so.2 => /usr/lib/libm.so.2 (0x2822d000)
        libc.so.4 => /usr/lib/libc.so.4 (0x28249000)

Next, Guido asked me to check some local variables 
in that stack frame.  See attached backtrace3.txt.

And to answer your question, yes I built all the Python
versions I'm testing myself.  I even rebuilt 2.0.1 and 
2.1.1 in the simplest manner to make sure I'm comparing 
them equally.  (Just ./configure && make).


----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-12-16 17:47

Message:
Logged In: YES 
user_id=31435

BTW, a google search on _thread_autoinit_dummy_decl turns 
up another FreeBSD bug report of a death deep in libc_r 
starting with a sprintf() call.  It had to do with mysql, 
not with Python.  The tracebacks in both cases look 
worthless to me, because (for example) 
_thread_autoinit_dummy_decl is an extern int in the FreeBSD 
source, not a function at all.

This increases the suspicion that libc versions are getting 
mixed up (and that gdb is looking up addresses in "the 
wrong" libc).

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2001-12-16 17:20

Message:
Logged In: YES 
user_id=6380

OK, it's the second sprintf call in formatint. Can you check
some local variables in that stack frame? I expect fmt to be
"%.1lx", but would like to see that confirmed; x should be
one of the values from your binary string: 206, 199, 166.
Can you see which one?

It would then be interesting to find out what a small C
program with that same sprintf() does. Make sure that the C
program is compiled and linked in exactly the same way as
python. (What I do in such cases is I put the experimental
statement in Python's main() and rebuild Python. :-)

The build procedure of each Python version is somewhat
different, and it's likely that each version picks slightly
different compile/link options. Also, did you build all the
Python versions you tried yourself?


----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-12-16 17:17

Message:
Logged In: YES 
user_id=31435

The new backtrace is helpful (thanks!), but it just makes 
it harder to believe that Python is at fault.  The segfault 
occurs four levels deep in system library functions, and 
from inspection there's nothing even remotely suspicious 
about the sprintf call that kicks it off.

As Guido suggested, can you verify you're not mixing the 
threaded and non-threaded versions of libc?

Short of that, you really need a FreeBSD libc author to 
help with this.

----------------------------------------------------------------------

Comment By: Jason R. Mastaler (jasonrm)
Date: 2001-12-16 15:26

Message:
Logged In: YES 
user_id=85984

Alright, I rebuilt Python 2.1.1 with -g, so the gdb
backtrace now contains some more information.  I
attached backtrace2.txt which is a backtrace of the
program failing when calling the hexlify() function.
I tried compiling with `--with-pydebug', but then I
can't reproduce the core dump for some reason.

As you mention, this could have something to do with
FreeBSD's libc.  However, does that explain why I only
have this problem with 2.1.1?  On the same system,
Python 1.5.2, 2.0, and 2.2c1 compiled in the same
manner (simply ./configure && gmake) don't produce a
core dump.


----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2001-12-16 11:24

Message:
Logged In: YES 
user_id=6380

See the repeated "(no debugging symbols found)..." in the
gdb output? try recompiling with -g. I strongly suspect that
the traceback is lying, and we're in fact not in
PyString_AsStringAndSize() when we call sprintf -- there's
no sprintf call in that function (I checked the 2.1.1
source) but there is one two static functions down, in
string_repr(), which is indeed called when you pickle a
string object.  This would match the original gdb stacktrace
you reported: there's no sprintf call in
_PyString_FormatLong(), but there are two in the next static
function, formatint(), and it does make sense that that
function would be called by the %02x format.

All this makes no sense, however, because I can see nothing
wrong with the arguments to sprintf here!

I've heard of problems on FreeBSD having to do with
different (threaded and unthreaded?) versions of libc. 
Could that be the issue here? 

Can you rebuild Python 2.1.1 with debugg flags on?
(./configure --with-pydebug should do it for 2.2; I don't
recall if the 2.2.1 configure supports that though -- try
--help first.)

----------------------------------------------------------------------

Comment By: Jason R. Mastaler (jasonrm)
Date: 2001-12-15 23:49

Message:
Logged In: YES 
user_id=85984

The argument to hexlify() is 3 bytes of raw binary.
It's simply an HMAC of '%d' % time.time() stored in binary.
Any suggestions on how to visualize, analyze, or store
this argument?  There is something about it that
Python really doesn't like.  I even tried to pickle
it, and that too made Python core dump (see attached
backtrace_pickle.txt).  Strangely, binascii.hexlify
has no problem handling it.


----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2001-12-15 17:18

Message:
Logged In: YES 
user_id=6380

Very strange. You could try setting a breakpoint in
PyString_Format.

But I would first try to boil down the program to a smaller
example. For example, figure out what the argument to
hexlify() is, and see if you get the same crash when you
call hexlify() on its own with that argument.

I find it curious why the core dump is in
_PyString_FormatLong(), since hexlify doesn't pass longs to
the % operator -- it passes a tuple of small ints.

----------------------------------------------------------------------

Comment By: Jason R. Mastaler (jasonrm)
Date: 2001-12-15 16:50

Message:
Logged In: YES 
user_id=85984

BTW, if you need access to a FreeBSD machine,
Sourceforge's "Compile Farm" has one available.

ssh to `cf.sourceforge.net'


----------------------------------------------------------------------

Comment By: Jason R. Mastaler (jasonrm)
Date: 2001-12-15 16:47

Message:
Logged In: YES 
user_id=85984

Alright, through print statements, I've been able to
pinpoint exactly where the program is failing causing
Python to core dump.  It is when the following
function is called:

def hexlify(b):
    return "%02x"*len(b) % tuple(map(ord, b))

Ironically, I got this function from you Guido.  :-)
(http://groups.google.com/groups?hl=en&selm=3924E204.CE01CA0A%40python.org)

I only have to replace the function body with:

return "foo"

and the program no longer core dumps.

The difficult thing has been trying to reproduce this
problem with a smaller piece of code.  I've been
trying for hours and can't do it.  As I said in my
original message, I've never had this problem before
this particular message.  Even though its contents
look harmless, there is something there that is
triggering this problem.  It seems pretty complex to
me.

However, I can run Python from inside gdb and get the
program to crash from there.  I changed the program to
open "mess1.txt" directly.  Not being experienced with
gdb though, I'm not sure what kind of "digging around"
to do.  Could you give me some direction here?  I've
attached "backtrace1.txt" as a starting point.


----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2001-12-13 21:25

Message:
Logged In: YES 
user_id=6380

Thanks for all the info, but I'm afraid you're going to have
to work harder. We don't have FreeBSD here, so your core
file won't do us any good. Can you use gdb to get a
traceback and do some more digging? (It would be best not to
do this from the core file, but to run the program under gdb
until it crashes -- that way the stacktrace is more reliable
and you can do more digging around.)

Also, it would be really helpful if you could boil this down
to a smaller piece of Python code. Maybe you can put print
statements in your Python program that give you an idea of
which operation is failing.

As it is, the problem is too hard to reproduce to be worth
debugging.

The sample message contains nothing objectionable, so I
think this must be a matter of something the program does
that reads or write a bad pointer, where the value of the
bad pointer depends on what was previously on the stack or
what is in some spot of the heap.

----------------------------------------------------------------------

Comment By: Jason R. Mastaler (jasonrm)
Date: 2001-12-13 21:13

Message:
Logged In: YES 
user_id=85984

Oops, the core file was too large to upload (1.5MB). 
If you want to see it, you can download it at:

http://www.mastaler.com/tmp/2001-12-13/python2.1.core


----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=493183&group_id=5470