[Mailman-Developers] Some notes on my stuff...

Chuq Von Rospach chuqui@plaidworks.com
Sun, 22 Jul 2001 22:15:15 -0700


We had discussions a while back on the performance problems I was seeing on
my big mailman machine. I figured it was probably time to update, so
everyone knew what was going on...

For those that don't remember, my big mailman machine (Sun E250, mailman
2.0.5, sendmail) was maxing out at about 625 messages processed a day -- and
we were seeing peak backlogs of 4 hours between receiving and processing.
Not good.

My testing showed the major processing delay was the speed of uptake by
sendmail, primarily due to DNS delays. And given that you can't turn off DNS
lookups without turning off a lot of the anti-spam stuff, you're kinda
stuck.

So I've been working to move my system sfrom sendmail to postfix. Along the
way, by pure happenstance (you can read that to mean "I was testing, I
screwed up, and it showed me something I wasn't looking for and didn't
expect to find"), I found out that there was a second, more subtle and
infernal problem that was making the main problem even worse: Disk I/O

It turns out I was saturating a disk spindle, and it wasn't obvious that I
was doing it. Once I found it, though, it was obvious what was happening --
sendmail was, basically, stuttering, and with the disk I/O causing delays,
every minor delay by DNS was being elongated, which was causing much of the
performance problem.

I was able to fix most of the disk I/O problem by splitting off a big chunk
to an underused spindle, and saw performance improve markedly -- it's now
processing 700+ messages a day (the peak day was 810), and my delay to
delivery is normally 5 minutes or less. I've seen a few times where that
delay's grown to a whopping half an hour -- but that, at least right now, is
fine.

Because of this, I've slowed down on the move to postfix, not because I
don't plan on doing it, but because other things have taken priority for now
and I want to do some testing before I make the change; as I like to tell my
management, swapping MTAs is like replacing the transmission in a bus just
before driving the kids to summer camp, and I don't particularly want to end
up on the side of the road because I was in too much of a hurry...). My
capacity magically went up at least 30% by fixing the disk I/O problems.

Which leads to a few truisms. First, when you're trying to find a problem,
assume nothing. And once you FIND the problem, don't assume it's THE
problem, it might instead be A problem -- in my case, there was a second,
underlying problem that the switch to postfix would have reduced but not
solved, and it's unclear by how much. And, of course, check everything. One
thing I didn't look at until after I stumbled into this was spindle usage,
and where I assumed things were okay, it turned out -- it wasn't.

And it's probably safe to say you should ALWAYS assume that disk I/O is a
problem, and rule it out before you start finding other problems... Because
it's something that will make all of the other problems worse, but not
necessarily easy to find unless you look for it specifically.

We're now talking about upgrading all of the email machines to RAID 1+0 down
the road, just to build a system that maximizes the disk performance, since
it's now clear I've been underestimating it's impact (even though I thought
I wasn't...)



-- 
Chuq Von Rospach, Internet Gnome <http://www.chuqui.com>
[<chuqui@plaidworks.com> = <me@chuqui.com> = <chuq@apple.com>]
Yes, yes, I've finally finished my home page. Lucky you.

The first rule of holes: If you are in one, stop digging.