[Mailman-Developers] Mailman 2.0 usage notes...

chuq von rospach chuqui@plaidworks.com
Tue, 10 Apr 2001 13:47:24 -0700


Been whapping at my big mailman machine the last week, because it's been 
slowly falling behind and never quite catching up. Unfortunately, it 
seems like I've simply hit capacity for now, so I'm looking for ways to 
extend that or at least minimize the damage until 2.1's multi-threading 
queueing comes online and I can use it (my big problem is the 
single-threading....)

One thing that would help would be if the Sendmail module got 
productionalized so it could be used instead of SMTPDirect, because then 
you could use the -deliverymode=defer option, which you can't on 
SMTPDirect because it disables some spamchecking.

But going elbow-deep into the server for a few days under a constant 
grinding load has brought forward a few things I thought I'd pass on...

First, there's a problem with the way queues are processed. qrunners 
uses:

	for file in os.listdir(xxxx)

to read the queue. The order is undefined, but in practice, it's 
basically blatting it out how it's stored in the inode. If you're not 
overly busy, that's fine. But If you start hitting the point where 
you're backing up, it means you process "N" slots into the directory, 
then qrunner exits and starts again, and it then re-covers the same 
directory slots. Things that go in and don't get processed simply NEVER 
get processed, unless the system quiets down enough to let qrunner to 
catch up. Qrunner *really* needs to process the queue FIFO.

Unfortunately, teaching qrunner to go FIFO is a bit complicated. you'd 
have to pull all of the filenames out, stat them all, and then sort 
that. There's a much easier solution, though --

In Mailman/Message.py where the filename si created, mailman uses 
time.time() and some other values to create a filename, which is then 
converted to hex. the idea is to create a unique filename. But in fact, 
time.time() should be unique (to be paranoid, one could grab it and then 
check to see if the filename exists and loop), and if you stored the 
queue files as "time.time()".msg/.db, then qrunner could sort the queue 
trivially, guaranteeing that the oldest messages are always processed in 
each queue run.

This creates some interesting race conditions, where an item is added to 
the queue and never comes out -- which causes people to think it's lose 
and repost it, adding to the queue clogging, which  slows stuff, 
which... until Saturday, when they all go home, the system slows down, 
and qrunner catches up and posts three day old messages.... And since 
the system seems to be working just fine, tracking it down is fun...

This also led to finding a problem in the bounce processing area. The 
bouncer works pretty well, but it has one flaw for which I don't have an 
easy answer. If I'm subscribed as "chuq@plaidworks.com", but for some 
reason the bounce comes back as "chuq@mail.plaidworks.com" (or vice 
versa, or if I'm forwarding mail in some other name), the bouncer will 
catch the bounce and try to process it, not find me, and log it as a 
"user not subscribed". Unless the admin is somehow post-processing the 
bounce logs, though, that bounce is never REALLY handled, so it bounces 
indefinitely, and the admin never knows. This also over time encourages 
queue clogging and wastes bandwidth and CPU and all of that -- and 
worse, list admins and site admins probably think everything is fine 
because the bouncing system is working and these "not a member" bounces 
are never reported anywhere.

On the other, other hand, you probably don't want to just blat all these 
at admins, since they'll tune them out. But some kind of nightly report 
of some sort is the tradeoff I'd make, I think, so admins could see 
continual bounce problems that need to be manually investigated. And I 
strongly recommend all site admins watch the bounce logs and look for 
these "missing" bounces, so they can be manually tracked. I found on my 
busy site this made a HUGE difference in my queue backlogs, too; these 
things were silently contributing a significant amount of traffic to the 
queue system and exacerbating my capacity issues.

I really think the qrunner issue needs to be dealt with; it only shows 
up on fairly busy sites, but it's a definite bug for folks like me. The 
bouncer issue is less nasty, but in a "good citizens clean up their 
trash" attitude, I think we need to at least make sure list/site admins 
are aware of these bouncers, unless someone can figure out a way to 
automate fixing them (this is a place where VERP type things could help, 
but I'm not going there, honest... giggle)

chuq



--
Chuq Von Rospach, Internet Gnome <http://www.chuqui.com>
[<chuqui@plaidworks.com> = <me@chuqui.com> = <chuq@apple.com>]
Yes, yes, I've finally finished my home page. Lucky you.

The first rule of holes: If you are in one, stop digging.