Re: [Mailman-Developers] (no subject)
On Mon, 11 Dec 2000 19:27:06 -0800 Chuq Von Rospach <chuqui@plaidworks.com> wrote:
At 7:16 PM -0800 12/11/00, J C Lawrence wrote:
Not exactly. My architecture has the ability to create messages internally that are then passed back thru the processing system.
oh, yeah. duh.
<snicker>
I kinda like the way you think.
that should scare you...
According to my wife you should be terrified about now.
FWLIW I'm working on the following leading notes:
--<cut>--
Assumption: The localhost is Unix-like.
ObTheme: All config files should be human readable unless those files are dynamically created and contain data which will be easily and automatically recreated.
ObTheme: Unless a data set is inherently private to Mailman, Mailman will not mandate a storage format or location for that data set, and will allow that data set to be the result of a locally defined abitrary replaceable process.
ObTheme: Every single program or process may be replaced with something else, as long as that other thing accepts the same inputs, generates outputs within spec, and performs a somewhat similar function.
There are basically three approaches to scalability in use for this sort of application:
Using multiple simultaneous processes/threads to parallelise a given task.
Using multiple systems running parallel to parallelise a given task.
Using multiple systems, each one dedicated to some portion(s) or sub-set of the overall task (might be all working in parallel on the entire problem (lock contention! failure modes!)).
The intent is to be able to transparently support all three models on a per-list basis or per-installation basis or some arbitrary mix of the two (some sections of the problem for some lists handled by dedicated systems, other sections of the problem for all the other lists handled either by a different pool of systems or processes).
Observation: MLMs are primarily IO bound devices, and are specifically IO bound on output. Internal processing on mail servers, even given crypto authentication and expensive membership generation processes (eg heavy SQL DB joins etc) are an order of magnitude smaller problem than just getting the outbound mail off the system.
Consider a mid-size list of 1K members. It is a busy list and receives 500 messages a day, each of which is exploded to all 1K members:
-- That's 500 authentication cycles per day. -- That's 500 membership list generations. -- That's 500,000 outbound messages -- That's 500,000/MAX_RCPT_TOs SMTP transactions
Even given a MAX_RCPT_TOS of 500 (a bit large in my mind) that's 1K high latency multi-process SMTP transactions versus 500 crytps crypts or SQL queries.
Observation: In the real of MLM installations there are two end points to the scalability problem:
- Sites with lists with very large numbers of members
- Sites with large numbers of lists which have few members.
Sites with large numbers of lists with large numbers of members (and presumably large numbers of messages per list) are the pessimal case, and is not one Mailman is currently targeting to solve.
The first case MLM is oubound bounnd. The second case may be local storage IO bound as it spends significant time walking local filesystems during queue processing which the outbound IO rates are comparitively small (and unbursty). Possibly.
SourceForge falls into the second case.
Observation: Traffic bursts are bad. Minimally the MLM should attempt to smooth out delivery rates to a given MTA to be no higher than N messages/time. This doesn't mean the MLM doesn't deliver mail quickly, just that in the case of a mail burst (suddenly 20Million messages sitting in the outbound queue), that the MLM will give the MTA the opportunity to try and react intelligently rather than overwhelming it near instantly with all 20M messages dumped in the MTA spool over 30 seconds while the spool filesystem gags.
There are five basic transition points for a message passing thru a mailing list server:
- Receipt of message by local MTA
- Receipt by list server
- Approval/editing/moderation
- Processing of message and emission of any resultant message(s)
- Delivery of message to MTA for final delivery.
#1 is significant only because we can can rely on the MTA to distinguish between valif list-related addresses and non-list addresses.
#2 is just that. The message is received by the MLM and put somewhere where it eill later be processed. The intent is that this is a lightweight LDA process that does nothing but write queue files. The MLM's business is to make life as easy as possible on the MTA. This is part of that.
#3 Mainly occurs for moderation, and encludes editing, approval, authentication, and any other requisite steps. The general purpose of this step is to determine what (if any) subsequent processing there will be of this message .
#4 Any requisite processing on the message occurs, and any messages generated by that processing are placed int he outbound queue.
#5 An equivalent to the current queue runner process empties the queue by creating SMTP transations for the entries in the queue.
The basic view I'm taking os the list server is that it is a staged sequence of processes, each invokved distinctly, orchestrated in the background by cron.
Note: Bounce processing and request processing re not detailed at this point as their rate of occurance outside of DoS attacks is comparitively low and are far cheaper than list broadcasts in general.
List processing is a sequence of accepting a message, performing various operations on it which cause state changes to the message and the list processing system, and optionally emitting some number of messages at the end.
As such this lends itself to process queues and process pipes.
We don't want an over-arching API, or the attempt to solve the entire peoblem with either one hammer, or one sort of hammer. The intent is to build something that the end user/SysAdm can adapt to his local installation without either stretching or breaking the model, and without needint to build an installation which is necessarily structurally very different from either the very light weight single machine small list system, or the larger EGroups/Topica equivalent.
By using process queues based on cannonical names in known filesystem locations and pre-defined data exchange formats between processes we can make the processes themselves arbitrary black boxes so long as they accept the appropriate inputs and generate the expected output.
--<cut>--
-- J C Lawrence claw@kanga.nu ---------(*) : http://www.kanga.nu/~claw/ --=| A man is as sane as he is dangerous to his environment |=--
At 7:51 PM -0800 12/11/00, J C Lawrence wrote:
ObTheme: All config files should be human readable unless those files are dynamically created and contain data which will be easily and automatically recreated.
ObTheme: All configuration should be possible via the web, even if the system is misconfigured and non-functional. Anything that can NOT be safely reconfigured without breaking the system should not be configurable via the web. (in other words, anything you can change, you should be able to change remotely, unless you can break the ssytem. If you cna break the system, you shouldn't be allowed near it trivially...)
Using multiple simultaneous processes/threads to parallelise a given task.
Using multiple systems running parallel to parallelise a given task.
Using multiple systems, each one dedicated to some portion(s) or sub-set of the overall task (might be all working in parallel on the entire problem (lock contention! failure modes!)).
that's my model perfectly, althought I think 2 and 3 are reversed. it's cleaner architecturally to go to divesting and distributing functionality before 'clustering'. In fact, I'm not sure clustering (which I'll use to term multiple mailman systems running in parallel) implies a system really, really large, when you realize that the primary resource eaters (like delivery) can effectively be infinitely distributed. I'm not sure how big a Mailman system you'd need ot require parallelizing the core process, as long as you can divest off other pieces to a farm that could grow without bounds. So maybe we don't need that next (complicated) step, and make it parallelized and distributable for everything except that core control process, but manage the complexity of that control process to keep everyting out of it exect the absolute necessity.
Observation: MLMs are primarily IO bound devices, and are specifically IO bound on output. Internal processing on mail servers, even given crypto authentication and expensive membership generation processes (eg heavy SQL DB joins etc) are an order of magnitude smaller problem than just getting the outbound mail off the system.
some of that is the MUA's problem, actually, but they get tied together. you don't, for instance, want an MLM who will dump 50K pieces of email an hour into the queues of an MUA that can only process 40K...
But in general, you're correct. Especially if you define DNS delays and SMTP protocol delays caused by the receiving machine to be "output" (grin)
Sites with large numbers of lists with large numbers of members (and presumably large numbers of messages per list) are the pessimal case, and is not one Mailman is currently targeting to solve.
but if you define the distribution capabilities correctly, this case is solved by throwing even more hardware at it, and the owners of this pessimal case presumably have a budget for it. If you see someone tryting to run Sourceforge on a 486 and a 128K DSL line, you laugh at them.
Observation: Traffic bursts are bad. Minimally the MLM should attempt to smooth out delivery rates to a given MTA to be no higher than N messages/time.
The obverse of that is that end-users seriously dislike delays, especially on conversational lists. It turns into the old "user expectation" problem -- it's better to hold ALL mail for 15 minutes so users come to expect it than to normally deliver mail in 2 minutes, except during the worst bulges... But in general, the MLM should deliver as fast as it reasonable can without overloading the MUA, which implies some kind of monitoring setup for the MUA, or some user-controlled throttling system. the latter unfortunately, implies teaching admins how to monitor and adjust, a support issue. The former implies writing an interface for every MTA -- a development AND support issue.
20Million messages sitting in the outbound queue), that the MLM will give the MTA the opportunity to try and react intelligently rather than overwhelming it near instantly with all 20M messages dumped in the MTA spool over 30 seconds while the spool filesystem gags.
I will not make comments about qmail. I will not make comments about qmail. I will be good. I will be good. (grin)
- Receipt of message by local MTA
1a) passthrough of message via a security wrapper from MTA to list server... (I think it's important we remember that, because we can't lose it, and it involves a layer of passthrough and a process spawning, so it's somewhat heavyweight -- but indispensable)
- Receipt by list server
- Approval/editing/moderation
- Processing of message and emission of any resultant message(s)
- Delivery of message to MTA for final delivery.
6) delivery of message to non-MTA recipients (the archiver, the
logging thing, the digester, the bounce processor....)
#1 is significant only because we can can rely on the MTA to distinguish between valif list-related addresses and non-list addresses.
although one thing I've toyed with is to give a subdomain to the MLM, and simply pass everything to it (in sendmail terms, using virtusertable to pass @list.foo.bar to mailman@foo.bar). Then you take the MLM out of having to know what lists exist and administrative needs to keep that interface in sync. The downside is it doesn't fit the design of some users (but that can be fixed by education if we can prove why it's better), and you get into having to handle some MTA functions, such as DSN compatible bounce messages. I've more or less decided than when I rewrite my internal corporate mail list, I'll do that rather than generate alias listings (for, oh, 12,000 groups) and teh hassles and overheads of all that. That'll be especially useful if we do waht I hope, which is set it up so the server has no data at all, but authenticates via LDAP to get list information on demand out of the corporate databases. There are some definite advantages to not knowing whether something exists until the need to know exists -- and as Mailman starts edging towards interfacing to non-Mailman data sources for list information, that ability grows in importance.
- is the processesing needed to support other functions that act on messages. The idea is that instead of delivering to the MTA, we have a suite of functions that deliver the message ot whatever needs to process it. Those can be asynchronous and don't need to be as timely as (5), and have different enough design needs that I split them out from the MTA delivery (although traditionally, stuff like digests are managed by doing an MTA transfer out of the MLM and back in to a different program...)
It also assumes that these non-delivery things are separate processes from teh act of making them available to those things, to keep (6) lightweight as possible.
Note: Bounce processing and request processing re not detailed at this point as their rate of occurance outside of DoS attacks is comparitively low and are far cheaper than list broadcasts in general.
and besides, they are basically independent, asynchronous processes that don't need to be managed by any of the core logic, other than handing messages into their queue and making sure they stay running. same with, IMHO, storing messages for archives, storing messages for digests, updating archives, processing digests (but the processed digest is fed back into the core logic for delivery), and whatever else we decide it needs to do that isn't part of the core, time-sensitive code base. (in fact, there's no reason why you couldn't have multiple flavors of these things, feeding archives into an mbox, another archiver into mhonarc or pipermail, something that updates the search engine indexes, and text adn mime digesters... by turning them into their own logic streams with their own queues, you effectivley have just made them all plug-in swappable, because you're writing to a queue, and not worrying about what happens once its there. you merely need to make sure it goes in the right queue, in the approved format.
We don't want an over-arching API, or the attempt to solve the entire peoblem with either one hammer, or one sort of hammer.
I like hammers! My thumb doesn't, not since the divorce, at least...
kewl. good stuff here.
-- Chuq Von Rospach - Plaidworks Consulting (mailto:chuqui@plaidworks.com) Apple Mail List Gnome (mailto:chuq@apple.com)
We're visiting the relatives. Cover us.
CVR> ObTheme: All configuration should be possible via the web,
CVR> even if the system is misconfigured and
CVR> non-functional. Anything that can NOT be safely reconfigured
CVR> without breaking the system should not be configurable via
CVR> the web. (in other words, anything you can change, you should
CVR> be able to change remotely, unless you can break the
CVR> ssytem. If you cna break the system, you shouldn't be allowed
CVR> near it trivially...)
Agreed completely. web_page_url comes to mind here. :(
-Barry
"CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:
CVR> The obverse of that is that end-users seriously dislike
CVR> delays, especially on conversational lists. It turns into the
CVR> old "user expectation" problem -- it's better to hold ALL
CVR> mail for 15 minutes so users come to expect it than to
CVR> normally deliver mail in 2 minutes, except during the worst
CVR> bulges...
That's been my experience as well. People expect email to do strange things to conversations, like show you replies before you've seen the original message, or have turnaround times on the order of quarter-hours or whatever. It's when the behavior they expect changes that people start to notice.
Case in point: I recently moved most of the python.org mailing lists to a new, faster machine with better network connectivity. Turnaround time tanked and I started getting complaints that messages weren't being seen in inboxes after 6 hours or so. To make matters worse, those messages were /in/ the archive! <scratch> <scratch>. Ah ha! /etc/syslog.conf was configured to log mail.* and syslogd was starving the MTA.
CVR> But in general, the MLM should deliver as fast as
CVR> it reasonable can without overloading the MUA, which implies
CVR> some kind of monitoring setup for the MUA, or some
CVR> user-controlled throttling system. the latter unfortunately,
CVR> implies teaching admins how to monitor and adjust, a support
CVR> issue. The former implies writing an interface for every MTA
CVR> -- a development AND support issue.
Let me change that a little bit. The MLM should /process/ messages as fast as possible, getting them through the moderate-and-munge and into the outbound queue at top speed possible. Once that message is sitting in that outbound queue, it's that queue's runner process that can be configured to throttle, distribute, batch, whatever it takes. It's not the MLM's problem (not to say it isn't the problem of the simple-minded qrunner script we distribute and enable by default). All I need to do is document the file format for the outbound queue files and site administrators can take it from there.
-Barry
At 7:18 PM -0500 12/14/00, Barry A. Warsaw wrote:
Let me change that a little bit. The MLM should /process/ messages as fast as possible, getting them through the moderate-and-munge and into the outbound queue at top speed possible. Once that message is sitting in that outbound queue, it's that queue's runner process that can be configured to throttle, distribute, batch, whatever it takes.
buy that man a beer.
It's not the MLM's problem (not to say it isn't the problem of the simple-minded qrunner script we distribute and enable by default). All I need to do is document the file format for the outbound queue files and site administrators can take it from there.
except I see this as still part of the MLM, since it's the tool doing the MLM->MTA handoff, not part of the MTA itself.
chuq
Chuq Von Rospach - Plaidworks Consulting (mailto:chuqui@plaidworks.com) Apple Mail List Gnome (mailto:chuq@apple.com)
We're visiting the relatives. Cover us.
"CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:
>> It's not the MLM's problem (not to say it isn't the problem of
>> the simple-minded qrunner script we distribute and enable by
>> default). All I need to do is document the file format for the
>> outbound queue files and site administrators can take it from
>> there.
CVR> except I see this as still part of the MLM, since it's the
CVR> tool doing the MLM->MTA handoff, not part of the MTA itself.
Fair enough. It's definitely not part of the MTA.
-Barry
At 9:49 PM -0500 12/14/00, Barry A. Warsaw wrote:
Fair enough. It's definitely not part of the MTA.
MTAs deliver stuff, real fast. but they're real dumb. You have to tell them exactly what to deliver, and so the MLM has to hand off finished pieces. think of the MTA as the fedex delivery truck. it doesn't address boxes or tape the packages or fill them with popcorn worms. All that is done in the warehouse of the shipper -- and that's the MLM. It goes on the truck, and the truck speeds away into the sunset with your package, and all you have is a tracking number...
-- Chuq Von Rospach - Plaidworks Consulting (mailto:chuqui@plaidworks.com) Apple Mail List Gnome (mailto:chuq@apple.com)
We're visiting the relatives. Cover us.
I like the idea of process queues, but I don't want to take the federation-of-processes architecture too far. Yes, we want a component architecture, but where I see the process boundaries is at the message queue level.
For the delivery of messages, I see Mailman's primary job as moderation-and-munge. Message come into the system from the MTA, nntp-scraper, web-board poster, or are internally crafted. All these things end up in the incoming queue. They need to be approved, rewritten, moderated, and eventually sent on to various outbound queues: nntp-poster, smtp-delivery, archiver, etc. Some of these are completely independent of the Mailman databases. E.g. it is a mistake that SMTPDirect is in the message pipeline in 2.0 because once a message hits this component, it's future disposition is (largely) independent of the rest of the system.
So in my view, when Mailman decides that a message can be delivered to a membership list, it's dropped fully formed in an outbound queue. The file formats are the interface b/w Mailman and the queue runners and should be platform (i.e. Python) independent. That way, I can ship a simple queue runner that takes messages from the outbound queue and hands them off to the smtpd, but /you/ could drop in a different runner process that uses GNQS to distribute load across an infinitely expandable smtpd server farm.
[Side note. Here's another reason why I'm keen on ZODB/ZEO as the underlying persistency mechanism for internal Mailman data: I believe we can parallelize the moderate-and-munge part of message processing. Because the ZEO protocols serialize writes at commit time, you could have multiple moderate-and-munge processes running on a server farm and guarantee db consistency across them. What I don't know is how ZEO would perform given a write-intensive environment (and maybe Mailman isn't as write intensive as I think it is). But even if it sucks, it simply means that the moderate-and-munge part won't be efficiently parallizable until that's fixed.]
"JCL" == J C Lawrence <claw@kanga.nu> writes: "CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:
JCL> There are five basic transition points for a message passing
JCL> thru a mailing list server:
| 1) Receipt of message by local MTA
| 1a) passthrough of message via a security wrapper from MTA to
| list server... (I think it's important we remember that, because
| we can't lose it, and it involves a layer of passthrough and a
| process spawning, so it's somewhat heavyweight -- but
| indispensable)
No problems here, because I see these as being outside the bounds of the MLM. The MLM has an incoming queue and it expects messages in a particular format (very likely just RFC822 text files). These arrive here via whatever tortuous path is necessary: MTA->security wrapper, nntpd->news scraper, web board cgi poster, etc.
| 2) Receipt by list server
| 3) Approval/editing/moderation
What I've been calling moderate-and-munge.
| 4) Processing of message and emission of any resultant message(s)
Here's where the output queues and process boundaries come it. Once they're in the outbound queues, Mailman's out of the loop.
| 5) Delivery of message to MTA for final delivery.
Again, that's the responsibility of the mta-qrunner, be it a simple minded Python process like today's qrunner, or batch processing system like you've been investigating.
These processes are not completely independent of Mailman though, e.g. for handling hard errors at smtp transaction time or URL generation for summary digests. Some of these can be handled by re-injection into the message queues (i.e. generate a bounce message and stick it in the bounce queue), but some may need an rpc interface.
| 6) delivery of message to non-MTA recipients (the archiver, the
| logging thing, the digester, the bounce processor....)
Each of these should be separate queues with defined process interfaces, but again there may be synchronous information communicated back to Mailman. The archiver discussions we've had come to mind here.
CVR> and besides, they are basically independent, asynchronous
CVR> processes that don't need to be managed by any of the core
CVR> logic, other than handing messages into their queue and
CVR> making sure they stay running. same with, IMHO, storing
CVR> messages for archives, storing messages for digests, updating
CVR> archives, processing digests (but the processed digest is fed
CVR> back into the core logic for delivery), and whatever else we
CVR> decide it needs to do that isn't part of the core,
CVR> time-sensitive code base. (in fact, there's no reason why you
CVR> couldn't have multiple flavors of these things, feeding
CVR> archives into an mbox, another archiver into mhonarc or
CVR> pipermail, something that updates the search engine indexes,
CVR> and text adn mime digesters... by turning them into their own
CVR> logic streams with their own queues, you effectivley have
CVR> just made them all plug-in swappable, because you're writing
CVR> to a queue, and not worrying about what happens once its
CVR> there. you merely need to make sure it goes in the right
CVR> queue, in the approved format.
I agree!
-Barry
On Thu, 14 Dec 2000 19:05:25 -0500 Barry A Warsaw <barry@digicool.com> wrote:
I like the idea of process queues, but I don't want to take the federation-of-processes architecture too far. Yes, we want a component architecture, but where I see the process boundaries is at the message queue level.
There are in essence seven queues:
- inbound Message arrives at the MLM
- authentication Do I accept it?
- moderation Does the list accept it?
- pending Associate a distribution list with message
- outbound Send it.
- bounce Demote the subscriber
- command A combo of #2, #3, and command processing.
There's a possible eighth for OOB stuff like archiving and digests, which I mostly see as a fork off the side of the pending queue.
So in my view, when Mailman decides that a message can be delivered to a membership list, it's dropped fully formed in an outbound queue.
Not exactly. It drops a mesasge, any relevant meta data, and a distribution list in the outbound queue. A delivery process then takes that and does what it will with them (eg VERP, application of templates, etc).
Process pipes...
The file formats are the interface b/w Mailman and the queue runners and should be platform (i.e. Python) independent.
Bingo. This is a point I've invested considerable time into.
That way, I can ship a simple queue runner that takes messages from the outbound queue and hands them off to the smtpd, but /you/ could drop in a different runner process that uses GNQS to distribute load across an infinitely expandable smtpd server farm.
If you continue the same abstraction across all queues and the staging processes of queus, you build something that isn't inherently a queue run-system, it merely looks like one and can in fact be fairly trivially hung off a queue based system (MQM or whatever).
Consider the following setup:
Three machines:
HostA is the primary MX and receives the list mail along with
all mail for the rest of the site.
HostB has a private hole in the firewall and is the only host to
have access to the backing stores for authentication and
membership data.
HostC has a nicely tuned MTA built for outbound processing.
Given a queue bases system supporting that, or something several dozen times more complex, becomes trivial. The problem is in making the architecture that runs on a single host without an external queue manager the same as the system above where different hosts each take responsibility for different queues in the message system.
It can be don, it just requires a little elegance.
[Side note. Here's another reason why I'm keen on ZODB/ZEO as the underlying persistency mechanism for internal Mailman data: I believe we can parallelize the moderate-and-munge part of message processing. Because the ZEO protocols serialize writes at commit time, you could have multiple moderate-and-munge processes running on a server farm and guarantee db consistency across them.
There are problems with this due to the fact that external transactions (such as SMTP sends) are asynchonous and not nested in ZODB transactions.
"JCL" == J C Lawrence <claw@kanga.nu> writes: "CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:
These processes are not completely independent of Mailman though, e.g. for handling hard errors at smtp transaction time or URL generation for summary digests. Some of these can be handled by re-injection into the message queues (i.e. generate a bounce message and stick it in the bounce queue), but some may need an rpc interface.
Thus the pending queue above -- it allows a mesasge to undergo a set of pre-post filters prior to landing in the outbound queue. Archiving, digests, all sorts of things can happen at that point.
-- J C Lawrence claw@kanga.nu ---------(*) http://www.kanga.nu/~claw/ --=| A man is as sane as he is dangerous to his environment |=--
"JCL" == J C Lawrence <claw@kanga.nu> writes:
| 1) inbound Message arrives at the MLM | 2) authentication Do I accept it? | 3) moderation Does the list accept it?
Remind me again about the difference between 2 and 3, and why 2 is under the purview of the MLM.
| 4) pending Associate a distribution list with message | 5) outbound Send it. | 5) bounce Demote the subscriber | 7) command A combo of #2, #3, and command processing.
I'm not totally sold that 4 needs a process boundary, or that 2, 3, and 4 aren't part of the same process, structured as interfaces in the framework.
So in my view, when Mailman decides that a message can be delivered to a membership list, it's dropped fully formed in an outbound queue.
JCL> Not exactly. It drops a mesasge, any relevant meta data, and
JCL> a distribution list in the outbound queue. A delivery
JCL> process then takes that and does what it will with them (eg
JCL> VERP, application of templates, etc).
In Mailman 2.0 the distribution list is also metadata -- it lives in the .db file. Do you see that differently?
[Side note. Here's another reason why I'm keen on ZODB/ZEO as the underlying persistency mechanism for internal Mailman data: I believe we can parallelize the moderate-and-munge part of message processing. Because the ZEO protocols serialize writes at commit time, you could have multiple moderate-and-munge processes running on a server farm and guarantee db consistency across them.
JCL> There are problems with this due to the fact that external
JCL> transactions (such as SMTP sends) are asynchonous and not
JCL> nested in ZODB transactions.
But I see the smtp sends as being in a separate process, i.e. the outbound qrunner. The easiest way I see of getting smtp hard failures back into Mailman is simply to craft an internal message and drop it in the bounce queue.
These processes are not completely independent of Mailman though, e.g. for handling hard errors at smtp transaction time or URL generation for summary digests. Some of these can be handled by re-injection into the message queues (i.e. generate a bounce message and stick it in the bounce queue), but some may need an rpc interface.
JCL> Thus the pending queue above -- it allows a mesasge to
JCL> undergo a set of pre-post filters prior to landing in the
JCL> outbound queue. Archiving, digests, all sorts of things can
JCL> happen at that point.
Maybe I'm taking the term "distribution list" too narrowly in your description about. Do you "list of recipient addresses" or do you mean "internal queue routing destinations", or something else?
-Barry
On Thu, 14 Dec 2000 22:37:51 -0500 Barry A Warsaw <barry@digicool.com> wrote:
"JCL" == J C Lawrence <claw@kanga.nu> writes:
- inbound Message arrives at the MLM 2) authentication Do I > accept it? 3) moderation Does the list accept it?
Remind me again about the difference between 2 and 3, and why 2 is under the purview of the MLM.
- pending Associate a distribution list with message 5) outbound Send it. 5) bounce Demote the subscriber 7) command A combo of #2, #3, and command processing.
I'm not totally sold that 4 needs a process boundary, or that 2, 3, and 4 aren't part of the same process, structured as interfaces in the framework.
Oddly enough I didn't write what I meant to write, and what I meant also wasn't what my own notes and documents say (which happen to be more correct than I was). I'm looking at the following process. First the base model:
Configuration of what exactly happens to a message is done by dropping scrpts/program in specially named directories (it is expected that typically only SymLinks will be dropped (which makes the web interface easy -- just creates and moves symlinks about)).
In general processing of any item will consist of taking that item and iteratively passing it to every script in the appropriately named directory, invoking those scripts in directory sort order (cf SysV init scripts with /etc/rc.# etc). This makes the web config interface easy -- it just varies the numerical prefix on the scripts to enforce a processing order.
Of course each script must follow a defined contract as far arguments, IO, return codes, and other behaviour.
The processing sequence for a message:
Message arrives in inbound queue.
Message is picked up it is passed on to moderation which consists of:
a) extracting a set of meta data from the message and any
associated resources and then associating that meta data with
the message (this is done as an efficiency support ala
pre-processing (very useful for later template expansion))
b) Iteratively passing the message thru all the scripts in the
moderation directory, in order, until either one returns
non-zero or all scripts have been run. All scripts returning
zero means that the message is moved to the pending queue.
Various non-zero returns have other effects ranging from instant
deletion, leaving the message in the inbound queue, moving to
the moderation queue ("holding pen" is more accurate),
auto-bouncing some sort of reply...
Some event happens to move a mesasge from the moderation queue to the pending queue (or it goes straight there, bypassing moderation).
Message is found in the pending queue and two things happen in order:
a) It is passed iteratively, with its meta data through every
script in the membership directory. A non-zero return means
that the message stays in pending. The combined output (stdout)
of all membership scripts forms the distribution list for that
message. Upon getting a membership (everything returns zero),
the resultant list is passed thru a specially anemd script (if
it exists) for post-processing (dupe removal, VERP instruction
insertion, domain/MX sorting, etc). The final distribution list
is associated with the message much like the meta data already
is.
b) The message is passed iteratively, with its meta data and
distribution list through the contents of the pre-post
directory. This does whatever it wants to do (archiving,
digests, templating, VERP spit outs, whatever). A non zero
return will leave the message in the pending queue for a later
pass.
Finally, having gained a distribution list and been pre-post processed (if there's anything there), the message is moved to the outbound queue with its distribution list.
The mesasge is found in the outbound queue and is handed off to whatever the transport is (MTA, NNTP, whatever).
Now where this gets interesting is that every script and tool along the path above is free to edit the message, edit the meta data, edit the distribution list, to cause the current message to be silently deleted, and/or to crete new or derived messages and to inject them into any other mesasge queue in the system.
So in my view, when Mailman decides that a message can be delivered to a membership list, it's dropped fully formed in an outbound queue.
Yes, with its associated distribution list.
JCL> Not exactly. It drops a mesasge, any relevant meta data, and a JCL> distribution list in the outbound queue. A delivery process JCL> then takes that and does what it will with them (eg VERP, JCL> application of templates, etc).
In Mailman 2.0 the distribution list is also metadata -- it lives in the .db file. Do you see that differently?
I see it as a class of meta data, but specifically one that is bound to and unique to the message in question, not to the list or the list config (consider the previous discussion of of nosy lists). As such a distribution list is genned for every message, and then attached to that message for later editing/delivery.
These processes are not completely independent of Mailman though, e.g. for handling hard errors at smtp transaction time or URL generation for summary digests. Some of these can be handled by re-injection into the message queues (i.e. generate a bounce message and stick it in the bounce queue), but some may need an rpc interface.
JCL> Thus the pending queue above -- it allows a mesasge to undergo JCL> a set of pre-post filters prior to landing in the outbound JCL> queue. Archiving, digests, all sorts of things can happen at JCL> that point.
Maybe I'm taking the term "distribution list" too narrowly in your description about. Do you "list of recipient addresses" or do you mean "internal queue routing destinations", or something else?
I don't understand your question. Given the above, could you rephrase?
-- J C Lawrence claw@kanga.nu ---------(*) http://www.kanga.nu/~claw/ --=| A man is as sane as he is dangerous to his environment=--
"JCL" == J C Lawrence <claw@kanga.nu> writes:
JCL> Configuration of what exactly happens to a message is done
JCL> by dropping scrpts/program in specially named directories (it
JCL> is expected that typically only SymLinks will be dropped
JCL> (which makes the web interface easy -- just creates and moves
JCL> symlinks about)).
At a high level, what you're describing is a generalization of MM2's message handler pipeline. In that respect, I'm in total agreement. It's a nice touch to have separate pipelines between each queue boundary, with return codes directing the machinery as to the future disposition of the message.
But I don't like the choice of separate scripts/programs as the basic components of this pipeline. Let me rotate that just a few degrees to the left, squint my eyes, and change the scripts to Python modules, and return codes to return values or exceptions. Then I'm sold, and I think you can do everything you want (including using separate scripts if you want), and are more efficient for the common situations.
First, we don't need to mess with symlinks to make processing order configurable. We simply change the order of entries in a sequence (read: Python list). It's a trivial matter to allow list admins to select the names of the components they want, the order, etc. and to keep this information on a per-list basis. Actually, the web interface I imagine doesn't give list admins configurability at that fine a grain. Instead, a site administrator can set up list "styles" or patterns, one of which includes canned filter sets; i.e. predefined component orderings created, managed, named, and made available by the site administrator.
Second, it's more efficient because I imagine Mailman 3.0 will be largely a long running server process, so modules need only be imported once as the system warms up. Even re-importing in a one-shot architecture will be more efficient than starting and stopping scripts all the time, because of the way Python modules cache their bytecodes (pyc files).
Third, you can still do separate scripts/programs if you want or need. Say there's something you can only do by writing a separate Java program to interface with your corporate backend Subject: header munger. You should be able to easily write a pipeline module that hides all that in the implementation. You can even design your own efficient backend IPC protocol to talk to whatever external resource you need to talk to. I contend that the overhead and complexity of forking off scripts, waiting for their exit codes, process management, etc. etc. just isn't necessary in the common case, where 5 or 50 lines of Python will do the job nicely.
Fourth, yes, maybe it's a little harder to write these components in Perl, bash, Icon or whatever. That doesn't bother me. I'm not going to make it impossible, and in fact, I think if that if that were to become widely necessary, a generic process-forking module could be written and distributed.
I don't think this is very far afield of what your describing, and it has performance and architectural benefits IMO. We still formalize the interface that pipeline modules must conform to, probably spelled like a Python class definition, with elaborations accomplished through subclassing.
Does this work for you? Is there something a script/program component model gives you that the class/module approach does not?
-Barry
On Fri, 15 Dec 2000 01:17:58 -0500 Barry A Warsaw <barry@digicool.com> wrote:
"JCL" == J C Lawrence <claw@kanga.nu> writes:
JCL> Configuration of what exactly happens to a message is done by JCL> dropping scrpts/program in specially named directories (it is JCL> expected that typically only SymLinks will be dropped (which JCL> makes the web interface easy -- just creates and moves symlinks JCL> about)).
At a high level, what you're describing is a generalization of MM2's message handler pipeline. In that respect, I'm in total agreement. It's a nice touch to have separate pipelines between each queue boundary, with return codes directing the machinery as to the future disposition of the message.
<nod>
But I don't like the choice of separate scripts/programs as the basic components of this pipeline. Let me rotate that just a few degrees to the left, squint my eyes, and change the scripts to Python modules, and return codes to return values or exceptions. Then I'm sold, and I think you can do everything you want (including using separate scripts if you want), and are more efficient for the common situations.
Fair dinkum, given the below caveat.
First, we don't need to mess with symlinks to make processing order configurable. We simply change the order of entries in a sequence (read: Python list). It's a trivial matter to allow list admins to select the names of the components they want, the order, etc. and to keep this information on a per-list basis.
<nod>
Actually, the web interface I imagine doesn't give list admins configurability at that fine a grain. Instead, a site administrator can set up list "styles" or patterns, one of which includes canned filter sets; i.e. predefined component orderings created, managed, named, and made available by the site administrator.
I'll discuss this later below (it comes down to a multi-level list setup/definition deal).
Second, it's more efficient because I imagine Mailman 3.0 will be largely a long running server process, so modules need only be imported once as the system warms up.
I have been working specifically on the assumption that it will not be a long running process, and that instead it will be automated by cron starting up a helper app periodically which will fork an appropriate number of sub-processes to run the various queues (with simple checks to make sure that the total number of queue running processes of a given type on a given host don't exceed some configured value. The base reason for this assumption is that it makes the queue processing more analagous to traditional queue managers, allowing the potential transition from Mailman's internal (cron based) automation to a real queue manager semi-transparent. The assumption in this was that the tool used to move a message between queues was an external explicitly standa-alone script. The supporting reason being a that simple replacement of that script by something that called the appropriate queue management tools for the queue manager de jour would allow the removal of the Mailman "listmom" and its replacement by the queue manager, be it LSF, QPS MQM, GNU queu, or something else.
This is what I mean by "light weight self-discovering processes that behave in a queue-like manner". The processes are small and light. They figure out what needs to be done locally per their host-specific configurations, and then do that in a queue-like manner.
What's this host-specific stuff? More later.
ObNote: There actually need to be seperate and discrete tools for both moving a given message into a specific queue (ie different tools for inbound, pending, oubound, etc) and different tools for injecting messages (that didn't exist before) into each queu. Doing it this way allows a site to roll part of the system to a queue manager and allow the rest to remain default. This could be done by a single tool linked to different names, or passing the queue name as an argument and allowing an easy call-out as above to a module-wrapped external tool.
Even re-importing in a one-shot architecture will be more efficient than starting and stopping scripts all the time, because of the way Python modules cache their bytecodes (pyc files).
I'm sold given the comment on the next paragraph.
Third, you can still do separate scripts/programs if you want or need. Say there's something you can only do by writing a separate Java program to interface with your corporate backend Subject: header munger. You should be able to easily write a pipeline module that hides all that in the implementation. You can even design your own efficient backend IPC protocol to talk to whatever external resource you need to talk to. I contend that the overhead and complexity of forking off scripts, waiting for their exit codes, process management, etc. etc. just isn't necessary in the common case, where 5 or 50 lines of Python will do the job nicely.
Then we should provide a template python module that accepts the approriate arguments passes them the a template external program, and grabs its stdout and RC. Configuring users could/would then merely take this, rename it, and customise it and roll it in transparently.
Fourth, yes, maybe it's a little harder to write these components in Perl, bash, Icon or whatever. That doesn't bother me. I'm not going to make it impossible, and in fact, I think if that if that were to become widely necessary, a generic process-forking module could be written and distributed.
Umm, yeah. Shame nobody thought of that.
I don't think this is very far afield of what your describing, and it has performance and architectural benefits IMO. We still formalize the interface that pipeline modules must conform to, probably spelled like a Python class definition, with elaborations accomplished through subclassing.
Bingo.
Does this work for you? Is there something a script/program component model gives you that the class/module approach does not?
Not inherently given a mathod for easy call outs as mentioned above.
Now onto the business of the host-specific configurations, what I've been looking at is something as below. The global list configuration consists of the following directories and files:
~/cgi-bin/* (MLM CGIs) ~/config (global MLM config) ~/config.force (global MLM config (can't change) ~/config.<hostname> (config specifics for this host) ~/scripts/* (all the tools and scripts that do things) ~/scripts/member/* (membership scripts) ~/scripts/moderate/* (moderation scripts) ~/scripts/pre-post/* (scripts run before posting) ~/inbound/* (messages awaiting processing by the MLM) ~/outbound/* (messages to be sent my the MLM) ~/services/* (the processes that actually run mailman) ~/templates/* (well, templates) ~/groups/ (groups of list configs) ~/groups/default/ (There has to be a default) ~/groups/default/... (Basically a full duplicate of the root setup, mostly done as symlinks) ~/groups/<groupname>/config (deltas from ~/config) ...etc
Then on the list base:
~lists/<listname>/config (list config as deltas from group config) ~lists/<listname>/group (symlink to ~/groups/<something>) ~lists/<listname>/moderate/* (messages held for moderation) ~lists/<listname>/pending/* (messages waiting to be processed) ~lists/<listname>/scripts/* (what does all the work)
The assumption so far is that the queues were represented as discrete files on disk, much like the current held messages in v2, with file names mapping to address/function of the message (ie list name plus command/request/post/bounce/reject/something) with filename extentions for various meta data sets, etc, (this helps keep things human readable). There are aspects of this I'm not happy with (eg for distribution lists on account of size (consider a 1M member list).
The idea is that the config files are simple collections of variable assignments much like the current Defaults.py or mm_cfg.py. Further, they are read in the following order:
~/config ~/groups/<groupID>/config ~/lists/<listname>/config ~/groups/<groupID>/config.force ~/config.force
Where the web interface would present the options that are locked by a higher level config (ie in a force file) as present but unconfigurable.
Now, the next thing, outside of populating the initial root directory with files (such as the various configures python modules etc), everything else gets gone from the web. One account has access to the root and can create and edit groups etc. Another account has access to the list configs, and then of course there are mdoerator-only accounts. All of this of course gets exported thru the standard authentication methods so thta it can get replaced by <whatever>.
-- J C Lawrence claw@kanga.nu ---------(*) http://www.kanga.nu/~claw/ --=| A man is as sane as he is dangerous to his environment |=--
participants (3)
-
barry@digicool.com
-
Chuq Von Rospach
-
J C Lawrence