Barry,
A week or so ago (right about the time I dissappeared) I had a drive die on the system I run Mailman from. I thought replacing the drive and restoring its contents from backups would be enough. It wasn't. It turns out that in dieing several other filesystems were corrupted in various odd and inelegant fashions (encluding both my tripwire DB and its backup sod it).
This has left me in an odd position:
If I post to a specific list, or approve a held post for that list, there is an 80% chance that this will crash the machine (compleat lock, no interrupts, no useful log entries.
This is reproducable. I've done it a great many times -- enough to wish I had a watchdog card in that machine. Its also rather scary -- Mailman is running as a non-privileged user after all.
As part of the recovery I've re-installed every single binary on the entire system (encluding Python et al). The one thing I haven't reinstalled is Mailman (v1.1). I also haven't dissembled or rebuilt the config.db's for the crashing lists.
Interested in the relevant files? I'll be saving everything off (of course), but I doubt I'll have time in the near future to disect this.
-- J C Lawrence Home: claw@kanga.nu ---------(*) Other: coder@kanga.nu http://www.kanga.nu/~claw/ Keys etc: finger claw@kanga.nu --=| A man is as sane as he is dangerous to his environment |=--
"JCL" == J C Lawrence <claw@kanga.nu> writes:
JCL> A week or so ago (right about the time I dissappeared) I had
JCL> a drive die on the system I run Mailman from. I thought
JCL> replacing the drive and restoring its contents from backups
JCL> would be enough. It wasn't. It turns out that in dieing
JCL> several other filesystems were corrupted in various odd and
JCL> inelegant fashions (encluding both my tripwire DB and its
JCL> backup sod it).
JCL> This has left me in an odd position:
JCL> If I post to a specific list, or approve a held post for
JCL> that list, there is an 80% chance that this will crash the
JCL> machine (compleat lock, no interrupts, no useful log entries.
JCL> This is reproducable. I've done it a great many times --
JCL> enough to wish I had a watchdog card in that machine. Its
JCL> also rather scary -- Mailman is running as a non-privileged
JCL> user after all.
JCL> As part of the recovery I've re-installed every single binary
JCL> on the entire system (encluding Python et al). The one thing
JCL> I haven't reinstalled is Mailman (v1.1). I also haven't
JCL> dissembled or rebuilt the config.db's for the crashing lists.
JCL> Interested in the relevant files? I'll be saving everything
JCL> off (of course), but I doubt I'll have time in the near
JCL> future to disect this.
I'm not sure what I can do, because I currently have no way of running Mailman 1.1. I could take your files and upgrade them to 2.0 and see what happens, but I'd be surprised if I get the same hard crash.
As you say, Mailman isn't doing anything special and has no special privs. How could that crash or hang your system? Maybe it's tripping a bug in your MTA, web server, or OS. What flavors and versions of those do you run?
Very odd.
-Barry
On Mon, 23 Oct 2000 23:03:47 -0400 (EDT) barry <barry@wooz.org> wrote:
I'm not sure what I can do, because I currently have no way of running Mailman 1.1.
I would of course be willing to prove my entire install, plus the (Debian) package it was installed from
I could take your files and upgrade them to 2.0 and see what happens, but I'd be surprised if I get the same hard crash.
Aye, that's an artificial and not very revealing test.
As you say, Mailman isn't doing anything special and has no special privs. How could that crash or hang your system? Maybe it's tripping a bug in your MTA, web server, or OS. What flavors and versions of those do you run?
Apache: 1.3.12 Exim: 3.10 Linux kernels: 2.2.10, 2,2,12, 2,2,16 2.2.16+ReiserFS, 2.4.0-test9 or 2.4.0-test9+ReiserFS
I'm certain the bug is not in Apache as it also occurs on post passing straight to the list without going thru moderation. It is possible it is in Exim, tho I'd be extremely surprised. For one I've reinstalled all binaries from known good sources, and have MD5ed all Exim files against both known good sources and the copies installed on other happily running machines. It is also unlikely that the bug is in the kernel as I've reproduced the problem with kernels built on other (untouched) machines and then installed on the offending machine, and on kernels built locally from cryptographically verified source balls.
Very odd.
Precisely.
As a total aside: I've become quite fond of ReiserFS. I didn't have it running previously to these problems, and only installed it when I started crashing multiple times a day (while trying to figure out why). Its been a real life and time saver.
-- J C Lawrence Home: claw@kanga.nu ---------(*) Other: coder@kanga.nu http://www.kanga.nu/~claw/ Keys etc: finger claw@kanga.nu --=| A man is as sane as he is dangerous to his environment |=--
You might try upgrading to Python 2.0, built from source. Maybe Mailman tickles something in Python that tickles something in the kernel.
Aside from that, trying to figure out exactly which chunk of Python code is causing the crash is the next thing to do. I'm afraid that if you're not getting tracebacks, you'll have to liberally sprinkle the code with prints to track this down.
-Barry
On Mon, 23 Oct 2000 23:31:51 -0400 (EDT) barry <barry@wooz.org> wrote:
You might try upgrading to Python 2.0, built from source. Maybe Mailman tickles something in Python that tickles something in the kernel.
<sigh> I'm not keen on that as it perturbs the base condition.
Aside from that, trying to figure out exactly which chunk of Python code is causing the crash is the next thing to do. I'm afraid that if you're not getting tracebacks, you'll have to liberally sprinkle the code with prints to track this down.
Quite.
-- J C Lawrence Home: claw@kanga.nu ---------(*) Other: coder@kanga.nu http://www.kanga.nu/~claw/ Keys etc: finger claw@kanga.nu --=| A man is as sane as he is dangerous to his environment |=--
"JCL" == J C Lawrence <claw@kanga.nu> writes:
JCL> Apache: 1.3.12 Exim: 3.10 Linux kernels: 2.2.10, 2,2,12,
JCL> 2,2,16 2.2.16+ReiserFS, 2.4.0-test9 or 2.4.0-test9+ReiserFS
BTW, my development platform is basically a stock RH6.1 kernel 2.2.12, Apache 1.3.12, Postfix 19991231, Python 2.0.
-Barry
On Mon, 23 Oct 2000, J C Lawrence wrote:
I'm certain the bug is not in Apache as it also occurs on post passing straight to the list without going thru moderation. It is possible it is in Exim, tho I'd be extremely surprised. For one I've reinstalled all binaries from known good sources, and have MD5ed all Exim files against both known good sources and the copies installed on other happily running machines. It is also unlikely that the bug is in the kernel as I've reproduced the problem with kernels built on other (untouched) machines and then installed on the offending machine, and on kernels built locally from cryptographically verified source balls.
I am running mailman from the beginning of mailman developing (John Viegas version, Im not sure if the name is correct, but cheers John). It runs on SMP machine and I have never such a problem. I have seen problems with forking of sendmail into 1000 processes while delivering messages in big (500 members) lists after comminting pending request. It is normal and I have solving it by changing the parametr, whichwas called like number of delivering processes. But my kernel was never confused from this like Yours. I would say, Your memory chips are wrong. User space program can never corupt filesystem.
cheers dan
--
________________________________________
DDDDDD
DD DD Dan Ohnesorg, supervisor on POWER
DD OOOO Dan@feld.cvut.cz
DD OODDOO Dep. of Power Engineering
DDDDDD OO CTU FEL Prague, Bohemia
OO OO work: +420 2 24352785;+420 2 24972109
OOOO home: +420 311 679679;+420 311 679311
________________________________________
Spatril jsem ji tak jak ji panbuh stvoril. A stal se ateistou.
At 11:03 PM -0400 10/23/00, barry@wooz.org wrote:
JCL> If I post to a specific list, or approve a held post for JCL> that list, there is an 80% chance that this will crash the JCL> machine (compleat lock, no interrupts, no useful log entries. JCL> This is reproducable.
I'm not sure what I can do, because I currently have no way of running Mailman 1.1. I could take your files and upgrade them to 2.0 and see what happens, but I'd be surprised if I get the same hard crash.
I sincerely doubt Barry would see it, because I'd be willing to bet dinner it's a bad block on the disk, and it's lodged in one of the files (probably the .db file) attached to that list.
i'd do a surface test of that disk and see if it finds problems. I'd give it 90% chance it will. This just screams "bad block!" at me.
-- Chuq Von Rospach - Plaidworks Consulting (mailto:chuqui@plaidworks.com) Apple Mail List Gnome (mailto:chuq@apple.com)
Be just, and fear not.
participants (4)
-
barry@wooz.org
-
Chuq Von Rospach
-
Dan Ohnesorg
-
J C Lawrence