Our corporate mail gateway adds a header to flag things it believes are spam. I'd like to be able to take advantage of this in my Mailman lists. I'm having some problems setting the header_filter_rules properly.
If I use a simple regex, like '^x-spam-flag: *yes' it seems to show up properly in the configuration. But if I use a more precise regex that accounts for all forms of whitespace with the "\s" sequence, like this '^\s*x-spam-flag:\s*yes\s*$' then what appears in the config file is '^\\s*x-spam-flag:\\s*yes\\s*$' in other words the backslashes have been quoted and are now literals.
I have tried this through the web U/I and using config_list and I get the same result. It seems pretty clear that both interfaces are sanitizing inputs by quoting things that could cause problems, but I haven't dug deep enough to find where that's happening. (I don't really want to have to customize Mailman for something like this.)
I was surprised that I couldn't find any other mention of these kinds of problems, and that the only examples people were using to illustrate the use of regexes in Mailman config files didn't involves special sequences like "\s". Is anyone else using more sophisticed REs successfully? Is there some trick or Python arcana that I'm missing (I know barely enough Python to get _into_ a paper bag)?
I haven't verified, but I assume the same treatment is applied to any config field that can take a regex.
Thanks in advance for any help.
David E. Bernholdt | Email: bernholdtde@ornl.gov
Oak Ridge National Laboratory | Phone: +1 (865) 574 3147
http://www.csm.ornl.gov/~bernhold/ | Fax: +1 (865) 576 5491
bernholdtde@ornl.gov wrote:
If I use a simple regex, like '^x-spam-flag: *yes' it seems to show up properly in the configuration. But if I use a more precise regex that accounts for all forms of whitespace with the "\s" sequence, like this '^\s*x-spam-flag:\s*yes\s*$' then what appears in the config file is '^\\s*x-spam-flag:\\s*yes\\s*$' in other words the backslashes have been quoted and are now literals.
Are you saying you see this with bin/dumpdb of the config.pck. If so, that's just the way python is showing the representation of the string. It is not the actual value of the string. If you doubt that, try 'strings' instead of 'bin/dumpdb'.
Also note the initial \s* in '^\s*x-spam-flag:\s*yes\s*$' really does nothing since if 'x-spam-flag:' is preceded by whitespace, it isn't a header. I.e., none of the headers presented to header_filter_rules regexps have leading whitespace.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On Tue, 19 Feb 2008 08:35:12 -0800 Mark Sapiro wrote:
Are you saying you see this with bin/dumpdb of the config.pck. If so, that's just the way python is showing the representation of the string. It is not the actual value of the string. If you doubt that, try 'strings' instead of 'bin/dumpdb'.
Ah, that seems to be it. dumpdb, config_list, and the web u/i all show the quoting of the backslashes, but strings on the pickle show it as I entered it. (Too bad such things don't "roundtrip" properly through tools that are supposed to do that, like config_list and the web u/i.)
So the next question is why doesn't the header_filter_rules appear to be working? Message are getting held (which is what I'm doing for testing purposes), but the indicated reason is non-subscriber posting rather than the header filter. (Both conditions are true for the majority of junk that comes through, but there are some lists where I really do need to allow legitimate non-subscriber posts, with moderator approvals.)
As I understand it, SpamDetect runs before Hold, and I thought that the first exception kicked it out of the handler processing.
Is there some logging I can turn on to see more details as to what's going on in here?
Thanks
David E. Bernholdt | Email: bernholdtde@ornl.gov
Oak Ridge National Laboratory | Phone: +1 (865) 574 3147
http://www.csm.ornl.gov/~bernhold/ | Fax: +1 (865) 576 5491
bernholdtde@ornl.gov wrote:
On Tue, 19 Feb 2008 08:35:12 -0800 Mark Sapiro wrote:
Are you saying you see this with bin/dumpdb of the config.pck. If so, that's just the way python is showing the representation of the string. It is not the actual value of the string. If you doubt that, try 'strings' instead of 'bin/dumpdb'.
Ah, that seems to be it. dumpdb, config_list, and the web u/i all show the quoting of the backslashes, but strings on the pickle show it as I entered it. (Too bad such things don't "roundtrip" properly through tools that are supposed to do that, like config_list and the web u/i.)
What Mailman version is this. In my case, the web u/i shows the value I enter without doubling the '\'.
Also, the 'round trip' issue seems OK, at least for config_list -o followed by config_list -i. config_list reads the \\s and converts \\ to a literal \ so what gets put in the config is '\' followed by 's' which is exactly what you want. If you gave config_list something like
header_filter_rules = [('^x-header:\s+some value$', 0, False)]
it would interpret '\s' as a literal 's' and you would lose the '\'. config_list needs either
header_filter_rules = [('^x-header:\\s+some value$', 0, False)]
or
header_filter_rules = [(r'^x-header:\s+some value$', 0, False)]
Note that there is a big difference between '\s' in a string and say '\n'. '\s' is two characters. '\' and 's' whereas '\n' is a single newline character.
So the next question is why doesn't the header_filter_rules appear to be working? Message are getting held (which is what I'm doing for testing purposes), but the indicated reason is non-subscriber posting rather than the header filter. (Both conditions are true for the majority of junk that comes through, but there are some lists where I really do need to allow legitimate non-subscriber posts, with moderator approvals.)
Good question. I think we need to determine why the web u/i is showing the doubled '\'. It shouldn't, and whatever is making it do so, may also be the reason for the rules not matching.
As I understand it, SpamDetect runs before Hold, and I thought that the first exception kicked it out of the handler processing.
That is correct.
Is there some logging I can turn on to see more details as to what's going on in here?
Unfortunately, no. You have to actually code additional logging in the handler.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Okay, it looks like the header_filter_rules are getting set correctly through all interfaces, including the web u/i -- I have been unable to reproduce the errors I had initially observed with escaping of backslashes. Perhaps I got my tests muddled.
I've also spent some time figuring out why even the simplest regex I could think of wasn't matching as I thought it should. I instrumented Handlers/SpamDetect.py to watch what was going on.
Here's the result: In applying header_filter_rules, it looks like the entire set of headers is being treated as a single multiline string.
For reasons I don't entirely understand (remember I'm not a python expert), "^" and "$" are not matching the beginning and end of individual lines of a multiline string, even though I interpreted http://www.python.org/doc/current/lib/matching-searching.html to say that they should, and a colleague who's very familiar with Python also thought they should.
If I don't have the line beginning/ending constraints in the regex, there is a risk (albeit small) that a subject header could match. So I ended up with "\nx-spam-flag:\s+yes\s*\n".
By the way, MM FAQ entry 3.32 and 3.51 are inconsistent about "^" and "$", and based on my experience 3.51 is wrong.
Thanks for your help.
David E. Bernholdt | Email: bernholdtde@ornl.gov
Oak Ridge National Laboratory | Phone: +1 (865) 574 3147
http://www.csm.ornl.gov/~bernhold/ | Fax: +1 (865) 576 5491
bernholdtde@ornl.gov wrote:
Here's the result: In applying header_filter_rules, it looks like the entire set of headers is being treated as a single multiline string.
That is correct.
For reasons I don't entirely understand (remember I'm not a python expert), "^" and "$" are not matching the beginning and end of individual lines of a multiline string, even though I interpreted http://www.python.org/doc/current/lib/matching-searching.html to say that they should, and a colleague who's very familiar with Python also thought they should.
It works for me. I just set a header_filter_rules regexp on a test list to "^subject:.*hello.*$" with a reject action, and my test post with
Subject: test Hello in subject
was rejected with "Message rejected by filter rule match". In case you're wondering, the Subject: was the 15th of 21 headers in the message delivered to Mailman.
I then tested the regexp "^subject:\s+yes\s*$" with a post with
Subject: yes
and it too was caught.
If I don't have the line beginning/ending constraints in the regex, there is a risk (albeit small) that a subject header could match. So I ended up with "\nx-spam-flag:\s+yes\s*\n".
That should be equivalent to "^x-spam-flag:\s+yes\s*$" in header_filter_rules. Does the actual search in your SpamDetect.py say
if re.search(pattern, headers, re.IGNORECASE|re.MULTILINE):
By the way, MM FAQ entry 3.32 and 3.51 are inconsistent about "^" and "$", and based on my experience 3.51 is wrong.
I have revised 3.32 which I think was the incorrect one, but we still have to resolve why it works for me and not for you.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Mark Sapiro wrote:
bernholdtde@ornl.gov wrote:
Here's the result: In applying header_filter_rules, it looks like the entire set of headers is being treated as a single multiline string.
That is correct.
For reasons I don't entirely understand (remember I'm not a python expert), "^" and "$" are not matching the beginning and end of individual lines of a multiline string, even though I interpreted http://www.python.org/doc/current/lib/matching-searching.html to say that they should, and a colleague who's very familiar with Python also thought they should.
It works for me.
I suspect you have an old Mailman version. Prior to Mailman 2.1.7, the regular expression search was not multiline and therefore '^' and '$' would only match the beginning and end respectively of the string of all the headers.
I have updated FAQs 3.32 (again) and 3.51 to note the version dependence.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On Fri, 29 Feb 2008 19:53:15 -0800 Mark Sapiro wrote:
I suspect you have an old Mailman version. Prior to Mailman 2.1.7, the regular expression search was not multiline
That's the answer, I'm using 2.1.5 from RHEL 4.
Thanks very much for your help with this!
David E. Bernholdt | Email: bernholdtde@ornl.gov
Oak Ridge National Laboratory | Phone: +1 (865) 574 3147
http://www.csm.ornl.gov/~bernhold/ | Fax: +1 (865) 576 5491
participants (2)
-
bernholdtde@ornl.gov
-
Mark Sapiro