>>>>> "MM" == Marc MERLIN <marc_news(a)valinux.com> writes:
MM> [I am not Ccing mailman-developers as this is not encouraged,
MM> but if someone on both lists thinks it should be forwarded
MM> there, please feel free]
I'm going to answer some of the other questions as best I can, and
then I propose to move this discussion to mailman-developers so we can
design a proper fix over there.
MM> The problem is due to qrunner being single threaded by default
MM> and having a global lock. Because some mailing lists have
MM> subscribers in domains where DNS is slow and unreliable, the
MM> MTA will hang on those rcpt to until DNS resolves or timeouts,
MM> and qrunner won't be done in time. After that, it's all
MM> downhill from there, more mail queues up, qrunner falls even
MM> further behind, etc, etc...
So one of the problems is that the handoff between Mailman and the MTA
is synchronous with some aspect of the MTA's delivery to the remote
site, namely dns lookup. One of the first things you need to do is
break this synchrony, either by improving the dns lookup on the MTA
side or putting the MTA in asynchronous mode for local message
acceptance. Basically you want Mailman to just say "here, don't
process this yet, just drop it in your outgoing queue and deal with it
later." This might be related to DSN (delivery status notification);
there's two DSN RFC (don't have the numbers handy right now), one that
talks about the mime bounce format and another that talks about esmtp
extensions for synchronous notification of delivery failures. You
/don't/ want that!
>From the followups, it sounds like Exim can be configured to take
local delivery asynchronously, and I believe that that is how Postfix
works by default. Dunno about Qmail or Sendmail, but I have to
believe it's possible to put them in those modes.
Now, to sketch out how I think Mailman ought to work for 2.1, and it
would not be too hard to whip something up with the 2.0 architecture
(ob plug: we might be able to arrange some consulting gigs with
Digital Creations if necessary).
First, I'd write a long-running process based around asynchat/asyncore
that was essentially our own bulk mailer. The async* modules are
standard Python modules which make possible select-based high
performance servers. They aren't multithreaded, and do not need to be
because these servers are primarily i/o bound. When i/o blocks on one
channel blocks, another one picks up and works for a while.
So this new process, let's call it `bulkmail'. Bulkmail would have
one (probably) unix socket open to take new outgoing messages from
qrunner. It'd probably write them to disk as a backup so failures
don't drop messages. I'm thinking it would then sort recipients based
on domains, and then it would start resolving MX records, caching the
results. There'd be bins for each MX containing pointers to the
messages that need to be delivered to that MX. As more messages came
in for that MX, they'd be dropped at the end of the bin.
Once a connetion to the MX is established, bulkmail would then just
start delivering messages to it until the bin was emptied. Any i/o
blocks in any of the processes will allow async* to switch to a
different delivery channel. We may need to do some explicit channel
management to make sure some are not starved.
We'd have to have a watchdog to make sure bulkmail is running, and I'm
sure there are other issues to work out, but I've gotta run. I think
this will work better than the current SMTPDirect threading stuff.