
On Mon, 11 Dec 2000 19:27:06 -0800 Chuq Von Rospach <chuqui@plaidworks.com> wrote:
At 7:16 PM -0800 12/11/00, J C Lawrence wrote:
Not exactly. My architecture has the ability to create messages internally that are then passed back thru the processing system.
oh, yeah. duh.
<snicker>
I kinda like the way you think.
that should scare you...
According to my wife you should be terrified about now.
FWLIW I'm working on the following leading notes:
--<cut>--
Assumption: The localhost is Unix-like.
ObTheme: All config files should be human readable unless those files are dynamically created and contain data which will be easily and automatically recreated.
ObTheme: Unless a data set is inherently private to Mailman, Mailman will not mandate a storage format or location for that data set, and will allow that data set to be the result of a locally defined abitrary replaceable process.
ObTheme: Every single program or process may be replaced with something else, as long as that other thing accepts the same inputs, generates outputs within spec, and performs a somewhat similar function.
There are basically three approaches to scalability in use for this sort of application:
Using multiple simultaneous processes/threads to parallelise a given task.
Using multiple systems running parallel to parallelise a given task.
Using multiple systems, each one dedicated to some portion(s) or sub-set of the overall task (might be all working in parallel on the entire problem (lock contention! failure modes!)).
The intent is to be able to transparently support all three models on a per-list basis or per-installation basis or some arbitrary mix of the two (some sections of the problem for some lists handled by dedicated systems, other sections of the problem for all the other lists handled either by a different pool of systems or processes).
Observation: MLMs are primarily IO bound devices, and are specifically IO bound on output. Internal processing on mail servers, even given crypto authentication and expensive membership generation processes (eg heavy SQL DB joins etc) are an order of magnitude smaller problem than just getting the outbound mail off the system.
Consider a mid-size list of 1K members. It is a busy list and receives 500 messages a day, each of which is exploded to all 1K members:
-- That's 500 authentication cycles per day. -- That's 500 membership list generations. -- That's 500,000 outbound messages -- That's 500,000/MAX_RCPT_TOs SMTP transactions
Even given a MAX_RCPT_TOS of 500 (a bit large in my mind) that's 1K high latency multi-process SMTP transactions versus 500 crytps crypts or SQL queries.
Observation: In the real of MLM installations there are two end points to the scalability problem:
- Sites with lists with very large numbers of members
- Sites with large numbers of lists which have few members.
Sites with large numbers of lists with large numbers of members (and presumably large numbers of messages per list) are the pessimal case, and is not one Mailman is currently targeting to solve.
The first case MLM is oubound bounnd. The second case may be local storage IO bound as it spends significant time walking local filesystems during queue processing which the outbound IO rates are comparitively small (and unbursty). Possibly.
SourceForge falls into the second case.
Observation: Traffic bursts are bad. Minimally the MLM should attempt to smooth out delivery rates to a given MTA to be no higher than N messages/time. This doesn't mean the MLM doesn't deliver mail quickly, just that in the case of a mail burst (suddenly 20Million messages sitting in the outbound queue), that the MLM will give the MTA the opportunity to try and react intelligently rather than overwhelming it near instantly with all 20M messages dumped in the MTA spool over 30 seconds while the spool filesystem gags.
There are five basic transition points for a message passing thru a mailing list server:
- Receipt of message by local MTA
- Receipt by list server
- Approval/editing/moderation
- Processing of message and emission of any resultant message(s)
- Delivery of message to MTA for final delivery.
#1 is significant only because we can can rely on the MTA to distinguish between valif list-related addresses and non-list addresses.
#2 is just that. The message is received by the MLM and put somewhere where it eill later be processed. The intent is that this is a lightweight LDA process that does nothing but write queue files. The MLM's business is to make life as easy as possible on the MTA. This is part of that.
#3 Mainly occurs for moderation, and encludes editing, approval, authentication, and any other requisite steps. The general purpose of this step is to determine what (if any) subsequent processing there will be of this message .
#4 Any requisite processing on the message occurs, and any messages generated by that processing are placed int he outbound queue.
#5 An equivalent to the current queue runner process empties the queue by creating SMTP transations for the entries in the queue.
The basic view I'm taking os the list server is that it is a staged sequence of processes, each invokved distinctly, orchestrated in the background by cron.
Note: Bounce processing and request processing re not detailed at this point as their rate of occurance outside of DoS attacks is comparitively low and are far cheaper than list broadcasts in general.
List processing is a sequence of accepting a message, performing various operations on it which cause state changes to the message and the list processing system, and optionally emitting some number of messages at the end.
As such this lends itself to process queues and process pipes.
We don't want an over-arching API, or the attempt to solve the entire peoblem with either one hammer, or one sort of hammer. The intent is to build something that the end user/SysAdm can adapt to his local installation without either stretching or breaking the model, and without needint to build an installation which is necessarily structurally very different from either the very light weight single machine small list system, or the larger EGroups/Topica equivalent.
By using process queues based on cannonical names in known filesystem locations and pre-defined data exchange formats between processes we can make the processes themselves arbitrary black boxes so long as they accept the appropriate inputs and generate the expected output.
--<cut>--
-- J C Lawrence claw@kanga.nu ---------(*) : http://www.kanga.nu/~claw/ --=| A man is as sane as he is dangerous to his environment |=--