[Mailman-Users] filtering based on message content
Mark Sapiro
mark at msapiro.net
Mon Jul 12 03:11:58 CEST 2010
Russell Clemings wrote:
>
>One other question: Is there an easy way to make it fire on parts of words
>as well as whole words? For example, I might want to catch "dig," "digger,"
>"digging," etc. (Not to mention "motherdigger.")
You can do pretty much any matching you want. For example
\b(mother)?dig(ger|ging)?\b
would match 'motherdig', 'motherdigger', 'motherdigging', 'dig',
'digger' or 'digging', but it wouldn't match 'diggery' because the \b
at the end of the regexp says "there must be a word boundary here"
where a word boundary is the begining or end of the line or a
transition from the set of letters, digits and underscore to something
else, whereas
\b(mother)?dig(ger\B*|ging)?\b
would also match 'diggery' and 'diggers'. It gets somewhat tricky. You
could just match 'dig' regardless of what follows or precedes it with
the regexp
dig
but then you also match 'digest', 'indigent' and so forth. I know that
'dig' isn't actually the word you're targeting, but the same problem
exists with most simple words.
See <http://docs.python.org/library/re.html#regular-expression-syntax>
or perhaps <http://oreilly.com/catalog/9780596528126/>.
The original expression I gave you
BADWORDS = re.compile(r'(\W|^)word3(\W|$)|(\W|^)word6(\W|$)', re.I)
is a bit more complicated than it needs to be because (\W|^) and (\W|$)
could just as well be \b. Using the 'verbose' mode of regular
expressions that allows you to insert white space for readability, you
could have something like
BADWORDS = re.compile(r"""\bword3\b |
\bword6\b |
\b(mother)?dig(ger\B*|ging)\b
""", re.IGNORECASE | re.VERBOSE)
Then later you could decide to add \b(mother)?diggingest\b with minimal
editing like
BADWORDS = re.compile(r"""\bword3\b |
\bword6\b |
\b(mother)?diggingest\b |
\b(mother)?dig(ger\B*|ging)\b
""", re.IGNORECASE | re.VERBOSE)
Another way to do this is like
WORDLIST = [r'\bword3\b',
r'\bword6\b',
r'\b(mother)?diggingest\b',
r'\b(mother)?dig(ger\B*|ging)\b',
]
BADWORDS = re.compile('|'.join(WORDLIST), re.IGNORECASE)
This just makes a list of simple regexps and then joins them with '|'
for the compiled re. In this case, re.VERBOSE isn't needed as we
introduce no insignificant white space.
--
Mark Sapiro <mark at msapiro.net> The highway is for gamblers,
San Francisco Bay Area, California better use your sense - B. Dylan
More information about the Mailman-Users
mailing list