On Tue, 24 Oct 2000 09:38:56 +0200 (CEST) Dan Ohnesorg dan@feld.cvut.cz wrote:
On Mon, 23 Oct 2000, J C Lawrence wrote:
I'm certain the bug is not in Apache as it also occurs on post passing straight to the list without going thru moderation. It is possible it is in Exim, tho I'd be extremely surprised. For one I've reinstalled all binaries from known good sources, and have MD5ed all Exim files against both known good sources and the copies installed on other happily running machines. It is also unlikely that the bug is in the kernel as I've reproduced the problem with kernels built on other (untouched) machines and then installed on the offending machine, and on kernels built locally from cryptographically verified source balls.
I am running mailman from the beginning of mailman developing (John Viegas version, Im not sure if the name is correct, but cheers John).
As have I.
It runs on SMP machine and I have never such a problem.
Ditto. I currently have various versions of Mailman running on three SMP systems without problems. The fact that this particular (other) SMP system is having Mailman problems does not seem related to SMP.
I have seen problems with forking of sendmail into 1000 processes while delivering messages in big (500 members) lists after comminting pending request.
This is a common MTA configuration issue, most often seen with QMail FWLIW. MTA tuning, especially as mail volumes grow, is a bit of an art. There was some interesting discussion on this area between Me and Chug on this list a couple months back you might want to look at.
But my kernel was never confused from this like Yours. I would say, Your memory chips are wrong. User space program can never corupt filesystem.
It is possible I have bad RAM. It seems rather unlikely however (see below),
You are missing data from the beginning of the thread.
I had a disk die (it held various mail archives). In dieing it not only took down the system, but also succeeded in trashing various bits of other filesystems on other devices. Among the files trashed were the tripwire database and its backups. This was not apparent when I replaced the drive and restored all relevant files from known-good/secure backups.
Given the new drive, the system remained unstable, crashing frequently (uptime measured in single digit hours). This is as compared to a previous uptime of near 200 days.
I then replaced every binary on the system from original verified packages, encluding the kernel (built a new kernel locally from cryptographicslly signed and checked sources, using a new hand-checked .config).
Crashes continued and seemed to be coincident with mail travelling thru Mailman, either thru the weba approval process, or direct through to the exploder (no approval)
The MTA at this point appeared to be happy. Several tens of thousands of messages a day travel through that system, and were successfully passing through the system between crashes (my secondary MXes were dumping mail onto the system at a rate of well over 2K messages per minute upon rebooting from an extended crash -- which the system took quite happily). All crashes were observed to be time coincident with Mailman mail activities.
Suspecting bad disk blocks and potentially other hidden filesystem troubles I then replaced all filesystems (except / and /boot) on the system with journalling filesystems (ReiserFS), doing a surface check on all partitions before putting the new filesystems on. I again replaced every binary on the system from confirmed correct packages, and built a new kernel on a known secure machine from crypographically signed and checked sources. Additionally I double checked by doing MD5Sum signature comparisons of key binaries on the target system, with specific attention paid the the mail system, against a known secure system. They matched perfectly.
Finally I ran a semi-burn-in on the system: leaving it over night continuously building kernels AND using SCP to send those kernels to and from a remote box (to hit the network stack) with MD5 checks on each end AND sending an average of 25K mail messages per minute to a another system on the local net (100base-T connected). The next morning there was not a single error in any file, all SCP copies had compleated without error, all MD5 checks were passed, and neither MTA listed any problems (the messages themselves were bit-bucketted).
I then rolled the box back into production.
Crashes continued.
They also continued after building a new kernel on the target machine -- from similarly verified sources (needed a slight tweak).
I then replaced mailman from known good sources.
Crashes continued.
I've now removed all bytecoded files in the Mailman installation. Additionally I've hand unrolled and re-rolled the config.db for one of the lists that appears to be creating troubles. The unrolled DB looked good. Additionally, as I had in excess of 30K messages in my MTA spool pending delivery to assorted unresponsive remote systems and I suspected that a corrupted queue file might have been causing problems with Exim (which does briefly run as a privileged user), I hand moved all spool entries from the target system to another known-stable/secure system.
We'll see what happens now.
-- J C Lawrence Home: claw@kanga.nu ---------(*) Other: coder@kanga.nu http://www.kanga.nu/~claw/ Keys etc: finger claw@kanga.nu --=| A man is as sane as he is dangerous to his environment |=--
participants (1)
-
J C Lawrence