Re: [Mailman-Developers] First big Mailing
On Thu, 10 Jan 2002 22:08:19 -0800 Marc Perkel <marc@perkel.com> wrote:
I'm doing my first big mailing with Mailman/Exim to deliver effector for the Electronic Frontier Foundation. The list is about 20,000 names.
I'm going to do some guessing here, so make allowances.
Anyhow - started out moving right along, maybe too well - saturated the T1 pretty quick and the system slowed down. But the email logs were really going fast. This went on for about 45 minutes.
Translation: A message came in, was approved (automatically or by moderation) and moved to ~mailman/qfiles along with a copy of the the distribution list. qunner ran a little while later (cron job), and started delivering the broadcast to the MTA (thus all the activity).
During this time I saw a number of errors. Messages indicating the I had too many files open (running tail on the exim logs). A few messages that looked like something couldn't open something.db or something like that.
Not good. Very not good. Loose guess: You have configured Exim for more queue runners, parallel deliveries, and/or simultaneous incoming deliveries that Exim exceeded either your kernel's maximum number of file handles per process, or the total number of file handles in the system.
You need to track this down and fix it. Now. Before you do anything else.
It may be enough (for the short term) to just reconfigure Exim to use a smaller number of processes etc, but that's a stopgap, not a fix.
The delivery slowed down as it it were done. System load dropped back to low normal levels.
This sounds like qrunner processes being forked by cron, each trying to deliver more of your 20K messages to Exim. Qrunner has an internal timeout (15mins IIRC) after which it will be reaped and a new process forked by the next cron pass.
This lasted a while - then things started back up again really delivering messages. These deliveries come in spurts.
Which would explain the above bit.
Anyhow - even though Exim is delivering other email. Messages sent to mailman are getting "stuck".
Odds are good that a qrunner process was ungracefully reaped resulting in a stale lock file in ~mailman/locks. As a result subsequent qrunner processes are doing nothing, waiting for the lock to timeout.
Fix:
Check that there are no qrunner processes running. If so, delete ~mailman/locks/*
Notes:
You *REALLY* need to fix your file handle problem. Its not unlikely that that's is a fundamental cause of your problems.
qrunner is responsible for all motion of mail thru the Mailman system, in receipt, moderation, and broadcasting. If qrunner is locked, nothing will happen until it is unlocked.
It's as if nothing in mailman is working. I see messages being sent to mailman. But mailman isn't responding.
Nope. They're being stashed in ~mailman/data, awaiting qrunner to Do The Right Thing (deliver to list, process as bounce, broadcast, etc.
I don't know if something is holding these messages and this is waiting in some queue - or if Mailman has crashed and is eating messages - or someing is corrupt or locked or overloaded or what.
See above.
Again: FIX YOUR FILE HANDLE PROBLEM FIRST!
Until you do you can lose mail and have inexplicable impossible to debug problems.
Notes:
If you can spare the systems, the first thing you'll need to do is separate the MTA that is handling final delivery from your Mailman machine. Bounce processing can and will *really* screw with your efficiencies and load with lists of that size.
Recommended architecture for scaling Mailman in your sort of situation is to have your Mailman system deliver all outbound mail to a smarthost (either via a smarthost rule on your MTA, or directly via SMTP config in Mailman). Ensure that the MXing for bounces will *NOT* go back to your smarthost, but will go directly to your Mailman system,
This allows your smarthost to be tuned for what you need it to do: handle outbound deliveries efficiently, and allows your Mailman system to remain responsive (and under less load as there's no local overburdened MTA queue) for processing inbound bounces etc.
Set SMTP_MAX_RCPTS in ~mailman/Mailman/mm_cfg.py to something reasonably large. Suggest something in the 50 - 100 range. Do not go any larger. This may help with temporarily resolving your file handle problem. It will also decrease system load in general and help smooth things along a bit. Later, when everything is known working, you can start tuning for performance and look at dropping SMTP_MAX_RCPTS down to around 5 (usually the sweet spot).
Anyhow - I'd like some general feedback on what might be happening. The newsletter contains an important story about Norway inditing Jon Johansen criminally. He's the guy who wrote the DVD code.
Yeah, I read it.
ObOffer:
If you would like some help offloading your mail traffic, I'm willing to smarthost a chunk of it for you (will need to verify with my upstreams). Basic idea would be to smarthost route a couple TLDs to me for final delivery (I've got a couple T3's so I should be able to take a fair percentage of your load).
--
J C Lawrence
---------(*) Satan, oscillate my metallic sonatas.
claw@kanga.nu He lived as a devil, eh?
http://www.kanga.nu/~claw/ Evil is a name of a foeman, as I live.
participants (1)
-
J C Lawrence