[Mailman-Developers] First big Mailing

Chuq Von Rospach chuqui@plaidworks.com
Thu, 10 Jan 2002 22:24:00 -0800

On 1/10/02 10:08 PM, "Marc Perkel" <marc@perkel.com> wrote:

> Anyhow - started out moving right along, maybe too well - saturated the T1
> pretty quick and the system slowed down.

That's not good. IT's not a mailman problem, but it's not good. You need to
tune your MTA to slow it down down so it doesn't overload the network. Once
it does, the different processes/threads start fighting each other for
network, and you'll see things go to heck. What I normally do is see what it
takes to saturate my network, and then tune the mailer to stop about 10% shy
of that. 

> During this time I saw a number of errors. Messages indicating the I had too
> many files open (running tail on the exim logs). A few messages that looked
> like something couldn't open something.db or something like that.

That is really, really, really bad. That's a kernel problem: you've filled
up the kernel tables handling open files, so any future attempt to open a
file plain old fails. Once that starts happening, if you're LUCKY, processes
fail gracefully and go away, but in many cases, that doesn't happen. At this
point, you're pretty hosed.

You need to see how many open files your system is configured to allow, and
(a) slow down your MTA so it doesn't blow away your kernel tables, but also
(b) have your kernel tuned to increase the number available.

> The delivery slowed down as it it were done. System load dropped back to low
> normal levels.

That's actually fairly normal -- as the delivery takes up more resource, the
mailman side tends to slow down. What you're likely seeing is competition
for disk I/O, and your disk(s) are likely saturated. That's a whole 'nuther
tuning discussion, but it's simply performance, not "it broke" bad.

> Anyhow - even though Exim is delivering other email. Messages sent to mailman
> are getting "stuck".

Not surprising, if the file table overflowed in the kernel. First, reboot.
Then, work at getting exim tuned down, it's overloading the system under
load and the system is, well, barfing.