Re: [Mailman-Users] privacy options, SPAM, regex
data:image/s3,"s3://crabby-images/3cb69/3cb69b4756d058e469a265c87105f0b104b4a536" alt=""
Helmut Schneider wrote:
I have lots of problems with out-of-office replies. I tried to set up a few filter rules using 2.1.10. Unfortuantely they don't catch them. Are the expressions case sensitiv? Are the expressions basic or extended? What I tried yet:
^subject:.*Accepted.* ^subject:.*Declined.* ^subject:.*is out of office.*
There are two different filters at # Privacy options... ->Spam filters, and they work differently.
The more flexible of the two is header_filter_rules. For header_filter_rules the regexps are matched against a multi-line string containing all the unfolded headers in the message, both message headers and sub-part headers. The regexp is a python regexp <http://docs.python.org/library/re.html#regular-expression-syntax> and the headers are searched <http://docs.python.org/library/re.html#re.search> for a match of the regexp in MULTILINE and IGNORECASE mode. This means the '^' matches the beginning of the string or the null character immediately following a newline and the match is case insensitive. Thus your above expressions look good.
That's weird. Messages still pass with e.g.
Subject: [Somelist] Declined: Invitation to workshop on 13rd Dec. 2008
in the Header. Do I need to escape the colon? Or something else?
Interesting, with "^subject:.*Declined.*"
Subject: Declined: [Somelist] Invitation to workshop on 13rd Dec. 2008
matches while
Subject: [Somelist] Declined: Invitation to workshop on 13rd Dec. 2008
does not. Huh?!
data:image/s3,"s3://crabby-images/56955/56955022e6aae170f66577e20fb3ce4d8949255c" alt=""
Helmut Schneider wrote:
Interesting, with "^subject:.*Declined.*"
Subject: Declined: [Somelist] Invitation to workshop on 13rd Dec. 2008
matches while
Subject: [Somelist] Declined: Invitation to workshop on 13rd Dec. 2008
does not. Huh?!
It turns out that RFC 2047 encoded headers are not decoded before matching against the regexps. Is that the issue here? What do the raw headers look like?
I think that the headers should be decoded, but I wonder if people are currently working around this with regexps that match encoded headers and wouldn't match decoded headers.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
data:image/s3,"s3://crabby-images/56955/56955022e6aae170f66577e20fb3ce4d8949255c" alt=""
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Mark Sapiro wrote:
Helmut Schneider wrote:
Interesting, with "^subject:.*Declined.*"
Subject: Declined: [Somelist] Invitation to workshop on 13rd Dec. 2008
matches while
Subject: [Somelist] Declined: Invitation to workshop on 13rd Dec. 2008
does not. Huh?!
It turns out that RFC 2047 encoded headers are not decoded before matching against the regexps. Is that the issue here? What do the raw headers look like?
I think that the headers should be decoded, but I wonder if people are currently working around this with regexps that match encoded headers and wouldn't match decoded headers.
I have developed a patch for SpamDetect.py which will decode RFC 2047 encoded headers. This is somewhat problematic because the decoded headers will presumably contain non-ascii characters, and while the character sets of the headers are known (and there can be different headers or even different parts of a single header encoded in different character sets), the character set of the regexps in header_filter_rules is not known. The patch creates a unicode object containing all the headers unfolded and RFC 2047 decoded with one complete header per line and then encodes it into the character set of the list's preferred_language, and this result is what the regexps will search. As long as the regexps contain only ascii and the raw headers contain no non-ascii characters, this should give expected results. If the regexps contain non-ascii characters or the headers contain non-ascii not RFC 2047 encoded, results may be unexpected. If in fact, the original issue is due to RFC 2047 encoded headers, try the patch and let us know how it works. - -- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) iD8DBQFJLwEfVVuXXpU7hpMRArKTAKCiDYtwz3VENF8Qww1tEw3lUMzUnQCgoGNh K8vySqy57Vn8w0EHpj6LeJM= =0pk1 -----END PGP SIGNATURE----- --- f:/test-mailman-2.2/Mailman/Handlers/SpamDetect.py 2007-07-17 11:06:14.000000000 -0700 +++ f:/test-mailman/Mailman/Handlers/SpamDetect.py 2008-11-27 11:53:59.468750000 -0800 @@ -26,9 +26,8 @@ """ import re -from cStringIO import StringIO -from email.Generator import Generator +from email.Header import decode_header from Mailman import mm_cfg from Mailman import Errors @@ -60,34 +59,21 @@ -class Tee: - def __init__(self, outfp_a, outfp_b): - self._outfp_a = outfp_a - self._outfp_b = outfp_b - - def write(self, s): - self._outfp_a.write(s) - self._outfp_b.write(s) - - -# Class to capture the headers separate from the message body -class HeaderGenerator(Generator): - def __init__(self, outfp, mangle_from_=True, maxheaderlen=78): - Generator.__init__(self, outfp, mangle_from_, maxheaderlen) - self._headertxt = '' - - def _write_headers(self, msg): - sfp = StringIO() - oldfp = self._fp - self._fp = Tee(oldfp, sfp) - try: - Generator._write_headers(self, msg) - finally: - self._fp = oldfp - self._headertxt = sfp.getvalue() +def getDecodedHeaders(msg, cset='utf-8'): + """Returns a string containing all the headers of msg, unfolded and + RFC 2047 decoded and encoded in cset. + """ - def header_text(self): - return self._headertxt + headers = '' + for h, v in msg.items(): + uvalue = u'' + v = decode_header(re.sub('\n\s', ' ', v)) + for frag, cs in v: + if not cs: + cs = 'us-ascii' + uvalue += unicode(frag, cs, 'replace') + headers += '%s: %s\n' % (h, uvalue.encode(cset, 'replace')) + return headers @@ -106,13 +92,10 @@ # TK: Collect headers in sub-parts because attachment filename # extension may be a clue to possible virus/spam. headers = '' + # Get the character set of the lists preferred language for headers + cset = mm_cfg.LC_DESCRIPTIONS[mlist.preferred_language][1] for p in msg.walk(): - g = HeaderGenerator(StringIO()) - g.flatten(p) - headers += g.header_text() - # Now reshape headers (remove extra CR and connect multiline). - headers = re.sub('\n+', '\n', headers) - headers = re.sub('\n\s', ' ', headers) + headers += getDecodedHeaders(p, cset) for patterns, action, empty in mlist.header_filter_rules: if action == mm_cfg.DEFER: continue
data:image/s3,"s3://crabby-images/3cb69/3cb69b4756d058e469a265c87105f0b104b4a536" alt=""
From: "Mark Sapiro" <mark@msapiro.net>
Mark Sapiro wrote:
Helmut Schneider wrote:
Interesting, with "^subject:.*Declined.*"
Subject: Declined: [Somelist] Invitation to workshop on 13rd Dec. 2008
matches while
Subject: [Somelist] Declined: Invitation to workshop on 13rd Dec. 2008
does not. Huh?!
It turns out that RFC 2047 encoded headers are not decoded before matching against the regexps. Is that the issue here? What do the raw headers look like?
I think that the headers should be decoded, but I wonder if people are currently working around this with regexps that match encoded headers and wouldn't match decoded headers.
I have developed a patch for SpamDetect.py which will decode RFC 2047 encoded headers. This is somewhat problematic because the decoded headers will presumably contain non-ascii characters, and while the character sets of the headers are known (and there can be different headers or even different parts of a single header encoded in different character sets), the character set of the regexps in header_filter_rules is not known.
The patch creates a unicode object containing all the headers unfolded and RFC 2047 decoded with one complete header per line and then encodes it into the character set of the list's preferred_language, and this result is what the regexps will search. As long as the regexps contain only ascii and the raw headers contain no non-ascii characters, this should give expected results. If the regexps contain non-ascii characters or the headers contain non-ascii not RFC 2047 encoded, results may be unexpected.
If in fact, the original issue is due to RFC 2047 encoded headers, try the patch and let us know how it works.
As far as I can see this patch works great. As a positive side effect, is it possible that this patch also affects uncaught bounces? I recieve lots of uncaught bounces now where a SPAM-filter was required before the patch.
Thanks a lot, Helmut
data:image/s3,"s3://crabby-images/56955/56955022e6aae170f66577e20fb3ce4d8949255c" alt=""
Helmut Schneider wrote:
As far as I can see this patch works great. As a positive side effect, is it possible that this patch also affects uncaught bounces? I recieve lots of uncaught bounces now where a SPAM-filter was required before the patch.
No. The patch has absolutely no effect on uncaught bounces. Uncaught bounces are messages sent to a LIST-bounces address that are not VERPed and are not recognized as DSNs. If spam is sent to a LIST-bounces address and makes it to Mailman, it will be an unrecognized bounce. SpamDetect.py and header_filter_rules are not involved at all in processing mail received at a LIST-bounces address.
Any change you observed in uncaught bounces is just a coincidence.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (2)
-
Helmut Schneider
-
Mark Sapiro