spam filtering messages containing certain 8 bit characters

Does Mailman base64 decode the subject before applying a regex, and if so, can I use UTF-8 character names in the regex to match various types of 8-bit characters?
Say, for example, that I want to block messages with "电话卡" somewhere in the subject line.
Obviously, the actual raw Subject header will be more like:
Subject: =?GB2312?B?[encoded stuff here]?= Subject: =?utf-8?B?[encoded stuff here]?=
I tried putting in a regex to hold messages matching: Subject: .*\u7535\u8bdd\u5361
And that didn't seem to work. As far as I can tell, there is no way to find a substring that will always match when the Subject header is base64 encoded.
(Putting in 'Subject: .*电话卡' also does not work).

On 10/12/2011 6:58 PM, William Yardley wrote:
No. header filter rules regexps are matched against the raw headers. If a header is RFC2047 encoded, it is not decoded.
I think this is correct. Each 3 bytes which are base64 encoded result in a 4-character base64 substring. If the characters you are looking for are encoded as a multiple of 3 bytes and begin on a 3-byte boundary, they will encode to a unique base64 string, but if they don't begin and end on a 3-byte boundary the base64 substring will be affected by what comes before and/or after. Thus, I don't think you can reliably match, even if you are only dealing with a single character set.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

On 10/12/2011 6:58 PM, William Yardley wrote:
No. header filter rules regexps are matched against the raw headers. If a header is RFC2047 encoded, it is not decoded.
I think this is correct. Each 3 bytes which are base64 encoded result in a 4-character base64 substring. If the characters you are looking for are encoded as a multiple of 3 bytes and begin on a 3-byte boundary, they will encode to a unique base64 string, but if they don't begin and end on a 3-byte boundary the base64 substring will be affected by what comes before and/or after. Thus, I don't think you can reliably match, even if you are only dealing with a single character set.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (2)
-
Mark Sapiro
-
William Yardley