GSoC 2013 - GNU Mailman - Introduction and Project Discussion
Sreyanth writes:
- Anti-spam / anti-abuse in Mailman.
A couple of people have mentioned anti-spam, and it's a frequently requested feature. Nevertheless, I don't think we should spend Google money and mentor time on it.
- Mailman is the wrong place to do filtering. It's equally effective, normally covers more messages, and is somewhat more efficient in resource usage to do it at the MTA.
- Any new algorithms *should* be made available at the MTA level where they can be best put to use by more people. This implies something that either plugs into existing filters (such as spamassassin) or MTAs (ie, milters) rather than a Handler.
- Adapting existing filters is generally pretty trivial: you write a 10-line custom Handler that pipes it to an external process. This isn't big enough for a GSoC project.
- To the extent that new algorithms are involved, I have doubts that Mailman mentors have the kind of expertise needed to really help with such a project (I could be wrong, but I certainly don't know much about that kind of text processing, and I don't know that anybody else in Mailman has expertise in it).
On the other hand, I don't know which project in GSoC would be a better place for it. It's possible to argue that Mailman is a reasonable place for it, but IMHO we probably shouldn't.
Regarding anti-abuse, we would like to do something about problems like backscatter. However, I have to wonder how much *code* (vs *specification* and *design*) is needed for those problems. If the project is really spec-heavy, it's probably not really what Google has in mind (based on comments on the mentors' list, not on any official Google pronouncements, though).
On 13-04-11 10:44 AM, Stephen J. Turnbull wrote:
- Mailman is the wrong place to do filtering. It's equally effective, normally covers more messages, and is somewhat more efficient in resource usage to do it at the MTA.
- Any new algorithms*should* be made available at the MTA level where they can be best put to use by more people. This implies something that either plugs into existing filters (such as spamassassin) or MTAs (ie, milters) rather than a Handler.
- Adapting existing filters is generally pretty trivial: you write a 10-line custom Handler that pipes it to an external process. This isn't big enough for a GSoC project.
- To the extent that new algorithms are involved, I have doubts that Mailman mentors have the kind of expertise needed to really help with such a project (I could be wrong, but I certainly don't know much about that kind of text processing, and I don't know that anybody else in Mailman has expertise in it).
Writing individual pipelines may be trivial, but making a user interface for managing said pipelines is non-trivial. Right now, our pipeline management interface is "there's a text box in postorius that lets you choose a pipeline. It's not even a dropdown, and you may be screwed if you make a typo" which is obviously not how I want it when we release. ;)
I see a potential project timeline going something like this:
A. make a set of custom Mailman 3 Handlers for some well-known existing anti-spam/anti-malware software. (Maybe 2-3 weeks of work here, finding 2-4 reasonable pieces of software, setting them up, writing the handlers, and testing them)
B. make an interface in Postorius so list admins can enable/disable/reorder these and any whitelisting happening within mailman. This should involve making an interface in Postorius that gives admins the ability to change the Pipeline being used, and will likely involve a small amount of user testing to make sure said interface doesn't have risk of disastrous results if the administrator does the wrong thing. (Another 3-4 weeks of work including user testing, unit tests, and documentation)
C. Figure out how to set up some sort of packager that can install handlers + antispam software so that the site admin has an easy way to set these up if requested. (Another 3-4 weeks of work, including testing any scripts on a few different OSes and extensive documentation)
D. If there's any time leftover, implement some clever new filter (and appropriate Handler) that makes use of the list information itself (e.g. subscriber list, archives, etc.) to make better spam decisions. (at this point, you've got maybe 2 weeks left in the GSoC timeline)
I think that constitutes enough useful-to-mailman work to justify the google funds, gets us some customizable spam filtering (which as you say, is a frequently requested feature), but doesn't turn us into something we're not. That's why anti-spam made this year's gsoc list even though we've always said "do it in the MTA" and I'm not about to change that policy in general.
Do feel free to disagree with me, of course, Stephen. Or complain that I'm using the lure of antispam to get someone solve my user interface for pipelines problem, which I totally am. ;)
Terri
Terri Oda writes:
Writing individual pipelines may be trivial, but making a user interface for managing said pipelines is non-trivial. Right now, our pipeline management interface is "there's a text box in postorius that lets you choose a pipeline. It's not even a dropdown, and you may be screwed if you make a typo" which is obviously not how I want it when we release. ;)
That's a more general issue (oh, I see you noticed that! :-), and I have no problem with doing something about it -- indeed, I'd be more than happy to (co-)mentor it because I just *love* custom Handlers. Here's what I would do:
- Get the list of handlers active to the list.
- Append the list of inactive handlers from Mailman/Handlers (the site's list, not the distributed handlers).
- The UI is table with rows containing a checkbox for "active handler" (the row should be greyed out if it's inactive), an ordinal (numerical), and the handler name (gold star for popping up a tooltip with a detailed description/docstring on mouseover).
- Users can either change the numbers (error checked for uniqueness), with a partial order on standard handlers -- if the partial order is violated (including a missing handler like "ToOutgoing") the user is warned; or (platinum star) drag the handlers into the order they like (with same checks on the partial order).
I see a potential project timeline going something like this:
A. make a set of custom Mailman 3 Handlers for some well-known existing anti-spam/anti-malware software. (Maybe 2-3 weeks of work here, finding 2-4 reasonable pieces of software, setting them up, writing the handlers, and testing them)
One week for that work, it's all in the FAQ already I suspect.
B. make an interface in Postorius so list admins can enable/disable/reorder these and any whitelisting happening within mailman. This should involve making an interface in Postorius that gives admins the ability to change the Pipeline being used, and will likely involve a small amount of user testing to make sure said interface doesn't have risk of disastrous results if the administrator does the wrong thing. (Another 3-4 weeks of work including user testing, unit tests, and documentation)
You think the design above will take more than two days (one to learn how to do D&D to reorder a list) to code, and 4 to document and test? (I'm assuming Mailman2 kinds of pipeline APIs are already available. If new REST API is needed, OK, 3 weeks total.)
C. Figure out how to set up some sort of packager that can install handlers + antispam software so that the site admin has an easy way to set these up if requested. (Another 3-4 weeks of work, including testing any scripts on a few different OSes and extensive documentation)
OK, yes, getting PyPI down for the Handlers themselves (while these *could* be delivered with Mailman, I think it would be more valuable to have a standard PyPI delivery protocol for 3rd party Handlers) will likely take that much time, and indeed one needs to deal with OS PMS.
Do feel free to disagree with me, of course, Stephen.
I am indeed a curmudgeon about the antispam stuff. I don't think the first release of Mailman 3 should contain an attractive nuisance like serious antispam in Mailman (vs antispam in the MTA). I'll try to keep such negative thinking to one paragraph per post, though. :-)
Or complain that I'm using the lure of antispam to get someone solve my user interface for pipelines problem, which I totally am. ;)
While I do think that an initial implementation is probably a total of about 2 weeks worth of work, I suspect that one could riff on the theme (hi, Barry, like that metaphor?) for a couple more weeks, and robust disaster recovery (saving off the old pipeline and restoring looks simple enough, but Mr. Murphy is lurking, I'm sure -- in particular, if we're going to allow through-the-web pipelines, we need to guarantee that received mail will not get lost if the pipeline is horked) could account for a couple more weeks.
On Apr 11, 2013, at 01:52 PM, Terri Oda wrote:
Writing individual pipelines may be trivial, but making a user interface for managing said pipelines is non-trivial. Right now, our pipeline management interface is "there's a text box in postorius that lets you choose a pipeline. It's not even a dropdown, and you may be screwed if you make a typo" which is obviously not how I want it when we release. ;)
Remember that in MM3 you have two processing queues. Think of the first as the "moderation" process and the second as the "modification" process.
http://pythonhosted.org/mailman/src/mailman/docs/8-miles-high.html#basic-mes...
When the message is first accepted by the LMTP server, it is dumped into the incoming queue. The incoming runner will send the message through the rule chain for target mailing list. (There are actually two of these chains for every mailing list - one for messages to the owner/moderator and one for messages posted to the list.) Rule chains are made up of individual links where each link usually contains a rule and an action. Rules only inspect the message, never modify it (though they can add metadata to the associated dictionary), and rules produce a binary decision. If the rule "hits", the action is take. If the rule "misses", the next link in the chain is executed.
At the end of the rule chain is usually a "truth" rule which always hits, and its action is to jump to the "accept" chain, which essentially just dumps the message into the pipeline queue.
The pipeline runner only handles messages that have been approved for posting, so it doesn't have to make any moderation decisions. All it has to do is modify the message for posting. Again, there are two pipelines, one for owner messages and one for list posts, but pipelines are much simpler. They are just sequences of handlers. Each handler does whatever it wants to the message, and it can use, add, or delete information from the metadata dictionary to do this work. E.g. handlers can remove or add RFC 822 headers, filter content, add those informative footers, etc. Handlers also copy the messages to other queues, so that other runners can send the message to the archives, NNTP, and the outgoing SMTPd.
I mention all this because to show that there are *lots* of ways the system can be configured, and I'm not sure all should be exposed via Postorius. Certainly the full power isn't something a list owner will usually want to be confronted with ;). Each chain and pipeline (as well as rules and handlers) have a unique name, and the choice of chain and pipeline is part of a list's style (i.e. initial setting), so that may be a better way to allow list owners some flexibility in setting up their lists.
Anyway, some things to keep in mind. :)
-Barry
Hi all! Thank you very much for awesome discussion here!
On Fri, Apr 12, 2013 at 1:22 AM, Terri Oda <terri@zone12.com> wrote:
On 13-04-11 10:44 AM, Stephen J. Turnbull wrote:
- Mailman is the wrong place to do filtering. It's equally effective, normally covers more messages, and is somewhat more efficient in resource usage to do it at the MTA.
- Any new algorithms **should** be made available at the MTA level where they can be best put to use by more people. This implies something that either plugs into existing filters (such as spamassassin) or MTAs (ie, milters) rather than a Handler.
- Adapting existing filters is generally pretty trivial: you write a 10-line custom Handler that pipes it to an external process. This isn't big enough for a GSoC project.
- To the extent that new algorithms are involved, I have doubts that Mailman mentors have the kind of expertise needed to really help with such a project (I could be wrong, but I certainly don't know much about that kind of text processing, and I don't know that anybody else in Mailman has expertise in it).
I agree.
Writing individual pipelines may be trivial, but making a user interface for managing said pipelines is non-trivial. Right now, our pipeline management interface is "there's a text box in postorius that lets you choose a pipeline. It's not even a dropdown, and you may be screwed if you make a typo" which is obviously not how I want it when we release. ;)
I see a potential project timeline going something like this:
A. make a set of custom Mailman 3 Handlers for some well-known existing anti-spam/anti-malware software. (Maybe 2-3 weeks of work here, finding 2-4 reasonable pieces of software, setting them up, writing the handlers, and testing them)
B. make an interface in Postorius so list admins can enable/disable/reorder these and any whitelisting happening within mailman. This should involve making an interface in Postorius that gives admins the ability to change the Pipeline being used, and will likely involve a small amount of user testing to make sure said interface doesn't have risk of disastrous results if the administrator does the wrong thing. (Another 3-4 weeks of work including user testing, unit tests, and documentation)
C. Figure out how to set up some sort of packager that can install handlers + antispam software so that the site admin has an easy way to set these up if requested. (Another 3-4 weeks of work, including testing any scripts on a few different OSes and extensive documentation)
D. If there's any time leftover, implement some clever new filter (and appropriate Handler) that makes use of the list information itself (e.g. subscriber list, archives, etc.) to make better spam decisions. (at this point, you've got maybe 2 weeks left in the GSoC timeline)
This really looks great! Almost what I actually expected from a project
like this. But, like Stephen and Barry pointed out, I am unsure as to how far this comes under GSoC's purview.
I think that constitutes enough useful-to-mailman work to justify the google funds, gets us some customizable spam filtering (which as you say, is a frequently requested feature), but doesn't turn us into something we're not. That's why anti-spam made this year's gsoc list even though we've always said "do it in the MTA" and I'm not about to change that policy in general.
Do feel free to disagree with me, of course, Stephen. Or complain that I'm using the lure of antispam to get someone solve my user interface for pipelines problem, which I totally am. ;)
Terri
Thanks for such a great timeline Terri. I dont have issues with this. As
Stephen and Barry said, I even liked the idea of having a MILTER interfaced at LMTP level.
On a overall positive note, I am quite convinced that giving the admin of the list with great flexible options to choose from (and as Barry pointed out, why should everything be exposed to the admin via Postorius?, which may not be of the admin's interest! ). I believe this could be make a nice GSoC project, but with many spam filters which people are already acquainted with, I am not sure how far people tend to use this feature.
Also, I would like to hear more about : Boilerplate stripper AND Better content-filtering / handling error messages. Boilerplate stripping is trivial to understand. But, can anyone elaborate on Better content-filtering / handling error messages? I strongly believe that Boilerplate stripping will be a cool thing to have in Mailman and obviously, who would not want to welcome better content-filtering / error handling techniques on board?
-- *Yours Sincerely* * * *Mora Sreyantha Chary* *Computer Engineering '14* *National Institute of Technology Karnataka* *Surathkal, India 575 025*
- Stephen J. Turnbull <stephen@xemacs.org>:
Sreyanth writes:
- Anti-spam / anti-abuse in Mailman.
A couple of people have mentioned anti-spam, and it's a frequently requested feature. Nevertheless, I don't think we should spend Google money and mentor time on it.
I concur.
- Mailman is the wrong place to do filtering. It's equally effective, normally covers more messages, and is somewhat more efficient in resource usage to do it at the MTA.
Spam-filtering is expensive. It should be done only once - at sender level and not for each recipient of a mailing list.
We could let Mailman do it when the mail enters, but what would be the gain? There's plenty of software out there that already knows how to battle spam.
Even worse! In some countries - take Germany for example - you either reject spam at SMTP session level while the sending client is still there and will take notice or you MUST deliver it - else you break the law because you took reponsibility for transport, but supressed the message.
Mailman is part of a mail system, but it I don't expect it will ever become the component that will communicate directly with a remote (spam sending) client.
All the work to add an anti-spam feature in Mailman would be 'useless' to countries with laws as I described above.
BUT ...
I think it would be real nice to have a MILTER interface at LMTP server level to allow mail modification as required. Mailman runs in large environments and all the 'large organizations' I have worked asked my team and me to customize how mail is processed. MILTER is a great interface to modify mail.
- Any new algorithms *should* be made available at the MTA level where they can be best put to use by more people. This implies something that either plugs into existing filters (such as spamassassin) or MTAs (ie, milters) rather than a Handler.
- Adapting existing filters is generally pretty trivial: you write a 10-line custom Handler that pipes it to an external process. This isn't big enough for a GSoC project.
- To the extent that new algorithms are involved, I have doubts that Mailman mentors have the kind of expertise needed to really help with such a project (I could be wrong, but I certainly don't know much about that kind of text processing, and I don't know that anybody else in Mailman has expertise in it).
On the other hand, I don't know which project in GSoC would be a better place for it. It's possible to argue that Mailman is a reasonable place for it, but IMHO we probably shouldn't.
I hate to stand in the way of someone, who wants to contribute to OSS, but IMHO we shouldn't either.
Regarding anti-abuse, we would like to do something about problems like backscatter. However, I have to wonder how much *code* (vs *specification* and *design*) is needed for those problems. If the project is really spec-heavy, it's probably not really what Google has in mind (based on comments on the mentors' list, not on any official Google pronouncements, though).
Has anyone ever mentioned SNMP as a feature for Mailman?
p@rick
-- [*] sys4 AG
http://sys4.de, +49 (89) 30 90 46 64 Franziskanerstraße 15, 81669 München
Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Axel von der Ohe, Marc Schiffbauer Aufsichtsratsvorsitzender: Joerg Heidrich
On Apr 12, 2013, at 08:28 AM, Patrick Ben Koetter wrote:
I think it would be real nice to have a MILTER interface at LMTP server level to allow mail modification as required. Mailman runs in large environments and all the 'large organizations' I have worked asked my team and me to customize how mail is processed. MILTER is a great interface to modify mail.
Do you mean a hook in Mailman's LMTP server process? I thought about that in my previous message but decided not to mention it because it's not clear to me how performant Mailman's current smtpd-based (read: async) LMTP server is. What I mean is, I'm not sure how much additional work we want the LMTP server to do.
It would be cool if someone did some performance testing of the LMTP implementation, and it would be cool if someone tried to add some hooks into that server. It might also be interesting to look into alternative implementations. Another reason to push for getting Mailman 3 onto Python 3 would be the ability to leverage Guido's Tulip work for better async IO performance.
Has anyone ever mentioned SNMP as a feature for Mailman?
Nope, but that would be interesting too.
-Barry
- Barry Warsaw <barry@list.org>:
On Apr 12, 2013, at 08:28 AM, Patrick Ben Koetter wrote:
I think it would be real nice to have a MILTER interface at LMTP server level to allow mail modification as required. Mailman runs in large environments and all the 'large organizations' I have worked asked my team and me to customize how mail is processed. MILTER is a great interface to modify mail.
Do you mean a hook in Mailman's LMTP server process? I thought about that in
Yes, I mean to hook MILTER capability into Mailman's LMTP server process.
my previous message but decided not to mention it because it's not clear to me how performant Mailman's current smtpd-based (read: async) LMTP server is.
It's not clear to me either, but now that you made me think about it I begin to ask myself how fast is fast enough and I also ask myself are we dealing with a bogey (had to look this up. hope it fits) or are trying to address a reasonable bottleneck. (I've experienced quite a few "problematic" situations in mail transport which turned out to be more driven by myth and oral history than by vested knowledge).
I agree we should measure, just in order not to speculate, but let me send some thoughts ahead before we take out to test performance:
Input/output ratio on a mailing list system is 1:n. Performance requirements on the receiving side should be the least to worry about.
In most usage scenarios that come to my mind companies run an MLM as a supplement to their 'regular' mail system. Only a minor ratio of mail that enters the mail system is routed forward to the MLM (here: MM3 LMTP server).
At the moment (MM2) mail enters Mailman via a script that is called. Scripts are _a lot_ slower than a server process. My understanding is MM3 will have an LMTP server process. Any site that switches to MM3 should experience a performance boost on the receiving side.
It seems to me most people will be off fine. Unfortunately I think "most people" will not need to use a MILTER, too.
What characterizes the remaining group:
They run sites dedicated solely to mailing lists.
They need special filtering (read: MILTER and other methods).
They split load via clusters.
They have their own development teams to customize and optimize software as required
What I mean is, I'm not sure how much additional work we want the LMTP server to do.
How much should it be able to do at all? Do you collect and log statistics at the moment? Personally I like the "delays=0.04/0.01/0.05/0.1" entry in Postfix's log. Quote from postconf(5):
The format of the "delays=a/b/c/d" logging is as follows:
· a = time from message arrival to last active queue entry
· b = time from last active queue entry to connection setup
· c = time in connection setup, including DNS, EHLO and STARTTLS
· d = time in message transmission
-- $ man 5 postconf | less +/delay_logging_resolution_limit
It would be cool if someone did some performance testing of the LMTP implementation, and it would be cool if someone tried to add some hooks into that server. It might also be interesting to look into alternative implementations. Another reason to push for getting Mailman 3 onto Python 3 would be the ability to leverage Guido's Tulip work for better async IO performance.
I'm short on time to do performance testing myself, but I'll forward the request to my team members since we are doing tests at the moment anyway. Maybe someone finds time to squeeze LMTP server testing in.
My first idea would be to use either Postfix smtp-source (multi-threaded SMTP/LMTP test generator) or swaks (Swiss Army Knife for SMTP) <http://www.jetmore.org/john/code/swaks/> and create a wrapper around it that produces the load.
Has anyone ever mentioned SNMP as a feature for Mailman?
Nope, but that would be interesting too.
We (sys4) will contribute the MIB and monitoring server during development, if someone takes onto the programming.
p@rick
-- [*] sys4 AG
http://sys4.de, +49 (89) 30 90 46 64 Franziskanerstraße 15, 81669 München
Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Axel von der Ohe, Marc Schiffbauer Aufsichtsratsvorsitzender: Joerg Heidrich
- Barry Warsaw <barry@list.org>:
On Apr 12, 2013, at 08:28 AM, Patrick Ben Koetter wrote:
I think it would be real nice to have a MILTER interface at LMTP server level to allow mail modification as required. Mailman runs in large environments and all the 'large organizations' I have worked asked my team and me to customize how mail is processed. MILTER is a great interface to modify mail.
Do you mean a hook in Mailman's LMTP server process? I thought about that in my previous message but decided not to mention it because it's not clear to me how performant Mailman's current smtpd-based (read: async) LMTP server is. What I mean is, I'm not sure how much additional work we want the LMTP server to do.
It would be cool if someone did some performance testing of the LMTP implementation, and it would be cool if someone tried to add some hooks into that server. It might also be interesting to look into alternative implementations. Another reason to push for getting Mailman 3 onto Python 3 would be the ability to leverage Guido's Tulip work for better async IO performance.
We did a quick test and blew 10.000 messages into Mailman 3's LMTP server. The hardware was/is a Pentium 2, 2 GB RAM machine with desktop discs - way below current server hardware.
It took the test 25 min. to submit all messages:
real 25m10.041s user 0m4.872s sys 0m7.700s
That makes an average of
400 msg/min or 6,6 msg/sec
Robert, who did the tests, Ralf and I agree that this is "way enough" for LMTP server performance.
If we add a MILTER interface, the milter applications hooked into the LMTP servers receiving process will slow down the income rate. The impact depends on what the specific application tests or what kind of modification it applies to the message. In general MILTERs are designed to work in memory only. No message will need to be written to a disc, which usually is the most expensive operation during mail processing.
At the moment we (at sys4.de) don't think it needs further testing, but we offer to do so if you have reason to do so.
p@rick
-- [*] sys4 AG
http://sys4.de, +49 (89) 30 90 46 64 Franziskanerstraße 15, 81669 München
Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Axel von der Ohe, Marc Schiffbauer Aufsichtsratsvorsitzender: Joerg Heidrich
On Apr 12, 2013, at 01:44 AM, Stephen J. Turnbull wrote:
A couple of people have mentioned anti-spam, and it's a frequently requested feature. Nevertheless, I don't think we should spend Google money and mentor time on it.
From the core's perspective, I tend to agree that there is some interesting things we'd like to add here, but it's probably not enough work to justify a GSoC slot. I'm not sure if additional ui work can pad that out.
I also agree that in general, we want to encourage sites to push anti-spam defenses into the MTA as much as possible. The counter argument is that we get plenty of requests from folks who have no control over their MTA and want to be able to configure Mailman to help reduce spam. I think the following avenues would be interesting to pursue.
Assume the MTA is doing filtering, and that messages will fall into three categories: known bad (these get dropped at the MTA), known good (these flow through), unsure. For the latter, the message will probably be marked in some way, e.g. a header with a spam score, and it would be good if Mailman has some facility (e.g. a rule) to parse that header and make disposition decisions based on that value. One thing Mailman can do that the MTA cannot is allow for human intervention for disposition.
Provide an option for messages to detour into spam filters like spamassassin during Mailman message processing. This probably means a rule which calls out to SA or equivalent, and stores the score in some metadata. A rule hit might mean that the message has a spam score higher than a threshold, in which case processing jumps to a chain which can discard, reject, or hold th message.
Regarding anti-abuse, we would like to do something about problems like backscatter. However, I have to wonder how much *code* (vs *specification* and *design*) is needed for those problems. If the project is really spec-heavy, it's probably not really what Google has in mind (based on comments on the mentors' list, not on any official Google pronouncements, though).
Agreed.
-Barry
participants (5)
-
Barry Warsaw
-
Patrick Ben Koetter
-
Sreyanth
-
Stephen J. Turnbull
-
Terri Oda