[Mailman-Developers] Huge lists

Chuq Von Rospach chuqui@plaidworks.com
Thu, 25 May 2000 00:29:48 -0700


At 12:05 AM -0700 5/25/2000, J C Lawrence wrote:
>Its tough to
>image a situation where my time and effort in replacing them (as a
>solo effort) would actually be worth it as versus throwing hardware
>at the problem or chatting up Wietse & co.

throwing hardware at a problem isn't always possible. but the place 
where rolling your own internal MTA starts becoming useful is when 
the list is big enough that the disk I/O involving the MTA starts 
becoming the significant limiter. With sendmail 8.9.x, that's fairly 
easy to run into. With sendmail 8.10, it seems to be better, and the 
multiple queue stuff solves a multitude of problems involving huge 
directory structures.

VERP exacerbates the problem, since # of batches sent to the MTA 
equals the # of addresses, which explodes the number of control 
files, which... So at some point, it makes sense to deliver direct to 
recipient rather than build batches into the MTA, and completely 
avoid the disk I/O and deliver right out of the database to the 
receiving SMTP client. You could strongly parallelize the delivery 
setup because you'd do away with all of the MTA overhead, and do all 
sorts of fun things, like prioritize your delivery sorting and the 
like.

Which, if you're trying to deliver 5,000,000 emails a day and do so 
within a time-sensitive time period gets important -- and for the 
other 99.5% of the universe, just doesn't matter that much (snork).

>   I've written list
>servers and mini-MTAs before.  There's a fair bit of hidden
>complexity and brain hurt in there I don't mind avoiding.

Yes, that's very true. Just dealing with MX gets gnarly.

>True.  Were Mailman asynchronous, a pattern as below would seem
>useful:
>
>   There is never more a single "queued message handler" process
>   (maybe multi-threaded, or not).  That process guarantees not to
>   feed messages to the MTA any faster than XXX messages per
>   second/minute, and to stop such feeding were system load to rise
>   above ZZZ.  The single instance rule prevents multiple handler
>   processes for multiple mailing lists maxxing out the MTA as they
>   all dump simultaneously.

That's basically how my big machine has evolved -- I'm using three 
queues, one to generate delivery batches (and requeue them into 
queue2), the 2nd queue to paralellize bulk_mailers into the MTA, and 
a third queue just for smartbounce and non-delivery batches, to keep 
them out of the way... It's nice, because my setup batch can generate 
a bunch of batches, and it's up to the queue system to make sure only 
"N" of them are running at any time, but any batch that hits slow 
domains doesn't back up huge numbers of addresses, the waiting 
batches slip into other slots. Oh, queuing theory is such fun. I got 
into computers to AVOID math...

>   The problem of multiple list servers
>   (boxes) dumping simultaneously to a remote MTA is properly, I
>   believe, outside of Mailman's purview.
>
>I don't see a value in trying to monitor MTA queue size.  Too MTA
>specific.

See the disk I/O issues above. In a perfect world, the MTA would 
self-throttle itself to avoid overload conditions. In practice, you 
have to be careful to both tune the MTA to maximize output, and the 
MLM to avoid blowing it out. If you have a burst that stuffs 2500 
batches into a sendmail queue all at once, then sendmail has that big 
directory problem i a big way, and your system goes to hell.

Sendmail 8.10 goes a long way to minimizing this, but still, you can 
force your MTA to thrash, and when you do, everything gets really 
unhappy. So perhaps you don't need to have the MLM monitor the MTA 
constantly and throttle itself, but that's actually not a bad thing, 
IMHO, if it can be done reasonably -- on the other hand, I wouldn't 
make it a big focus, since it'd be a LOT easier to write some docs on 
how to tune the system adn what to watch out for, and let the admin 
do the tuning. Once the tuning is done, it probably won't require a 
lot of watching...


>  > Well, this is probably preaching to the choir, but I've gotten
>>  quite convinced that you isolate every piece you can from every
>>  other piece, and document the interfaces. that makes it quite easy
>  > to swap out a new piece without affecting the rest of the system
>
>This is often called, "programming by contract".  Its a Good Thing.

Heh. It's also called "breaking a huge project down into tiny pieces 
so your customers don't worry nearly as much about deadlines"...

>One of my list members has been advocating WebCrossing.  What do you
>think of it?

Not appropriate for this list. Let's talk offline. I'm designing it 
out of my systems in favor of other things, but the reasons are 
complex -- and I've recommended it INTO at least one major 
development project at the same time. So I guess the answer is "it 
depends, but I'm not going to be using it myself..."

>  > you could do something really nice with PHP and MySQL, too, and


>Yeah, I've thought about that but I really just don't see enough
>advantage to justify the time it would take to get something better
>than I have now.

I wonder how much of this could be driven out of something like 
Midgard? But loading your entire archives into a database gives you 
the ability to do all sorts of interesting linking and searching and 
stuff, and "all" you'd need is some email->XML converter, and then...

Oh, man. We need to at least pretend to be on topic for this list, 
but I need a white board and a pen... (scribbly scribble...)


-- 
Chuq Von Rospach - Plaidworks Consulting (mailto:chuqui@plaidworks.com)
Apple Mail List Gnome (mailto:chuq@apple.com)

And they sit at the bar and put bread in my jar
and say 'Man, what are you doing here?'"