[Mailman-Users] Call for suggestions

Chuq Von Rospach chuqui at plaidworks.com
Wed May 9 06:57:59 CEST 2001


On 5/8/01 5:18 PM, "J C Lawrence" <claw at kanga.nu> wrote:

> If I were to go for a first order attempt at reliability/scalability
> I'd be tempted to do something like

> Outbound list mail is not delivered to the local MTA but to a set
> of remote MTAs hidden behind a DNS round robin

The problem is that with mailman 2.0, the limiting factor is the
single-threadedness of qrunner -- it can only process one message at a time,
and throwing multiple MTAs for delivery won't help, because it can only feed
one at a time. My MTA on my big Mailman machine is basically idle all the
time, because I can't feed it fast enough to make it break a sweat.

> Spool on the MTA systems is on a battery backed RUPP Silicon disk
> (10Gig units IIRC, not cheap).  Big RAM heavy machines capable of
> running mid thousands of simultaneous queue runners (under QMail
> as happens).

Funny you should mention this. I've been researchign some of this myself
recently. There's a problem here -- LAN/WAN saturation. It doesn't matter
how fast your machine is if you saturate your 100baseT (which it looks like
I'm doing on my really-big machine -- we're working to get a quad ethernet
configured into it as we speak). You can build a really fast machine, but if
you can only outflow 12 megabytes a second, all the RAM disk in the universe
won't help -- you still get 12 megabytes a second on 100baseT. And if it
feeds a slower net -- you're only as fast as your slowest piece...

This is one reason I'm working on a design I'm calling 'attack of the killer
smurfs" -- racks of small, cheap, fast boxes that ONLY deliver e-mail, each
with its own CPU, memory, disk and 100baseT or gigabit interface. Because fr
what it costs to build a really big, fast muther machine, you can rack up a
dozen small, cheap machines each with its own net interface -- and if you
can hook that up to a set of multiple OC-3's...

But then you have t have a system to get the mail out to the smurfs -- and
standard systems just won't cut that...

> Domain routing based on historical MX profiling partitioned the
> spool base and the resultant spool entries are smarthosted out to
> a pool of delivery servers (more RUPP disks), with the really slow
> MXes being pulled out and dumped on the black hole box (constantly
> overwhalmed and struggling to deliver mail to boxes that were
> mostly not there).

I'm just starting to investigate that. I'm seriously thinking of dedicating
a subset of my smurfs to handling defined "slow" domains. I also run a long
time-between-retries (6 hours), so slow/dead domains don't overloy slow
stuff down -- I find that's a reasonable alternative to moving stuff between
queues and having "bog" queues and all of that stuff....

> Check your RAM usage patterns.  Watch swap.  Add RAM if indicated.

Watch EVERYTHING. Figure out where your bottlenecks are, not where you think
they might be. And IMHO, the most important thing you can do is learn how to
tell at a glance when the system is running well, because that's how you
learn to figure out when the system *isn't* running right, and you know how
it ought to look like, so you're better at finding what's wrong.

I *always* have windows up on my monitor machine running things like top, se
(if you're running solaris), and scrolls of the logs. On one window I always
do

    cd ~mailman/logs
    foreach I (*)
    tail -f $I &
    end
    kill %4 ; kills the USENET log, which is too chatty and gets in the way)

So I can see at a glance what's going on. And after you watch fro a while,
you'll see trends of what's going wrong, and can tell even from across the
room when the logs look 'wrong' (like endless qrunner lock losts, which
indicates qrunner might have wedged)

> You are *NOT* CPU bound at this point.  You may be disk bound if
> your disks are really slow, syslog is on the same spindle as
> spool, and/or you have syslog configured to sync mail log writes.

Minimize your syslog writing -- good point. The critical delays are MTA,
DNS, qrunner single-threading, and then network. On mailman, I've never seen
it overload disk, and it's just not that RAM intense.

> RAM and disk bandwidth are your enemies at this point.

I disagree, actually. You'll run out of network before either, IMHO.






More information about the Mailman-Users mailing list