[Tutor] Why doesn't this regex match???

Tim Peters tim.one@comcast.net
Sat, 09 Feb 2002 17:02:35 -0500

[Sheila King]
> ...
> Then someone else suggested the regular expression (granted, someone who
> is rather adept at them, and is really a Perl coder who is just now
> learning Python). I'm not sure we gained much by it, but then I had
> thought to do this:
> take the entire list of spamphrases and form one big regex out of them
> and instead of looping through the list, simply do a single regex test
> on the subject line.

You're very close to getting that approach to work; I'm sure you can
complete it now.

> We are looking to run this filter on a community server that gets lots
> of emails to lots of different accounts and we expect a number of people
> will be using it. We don't want to have the sysadmin tell us to turn it
> off later, due to resource usage. So, I thought that a single regex test
> would be more efficient than looping through the list as we had done
> before.

You don't have to guess, you can set up artificial tests and time them.  A
single regexp will likely be faster.  OTOH, if you accumulate enough
patterns, you may eventually bump into a limit on how large a regexp can be.
The latter isn't terribly likely with Python, but is very likely with some
other regexp packages.

> Comments on this logic not only welcome, but actually desired!!!

As a learning exercise, it's fine.  As a real-life spam filter, writing your
own is a dubious idea:  spammers and anti-spammers are in an escalating
technology war, and simple filters are increasingly ineffective (for any
defn. of "simple").  So if the goal is to tag spam instead of learning how
to write spam filters (there's no law that you can't do both, of course
<wink>), time would be better spent installing a state-of-the-art spam
filter.  For example,


is free for the taking.  It happens to be written in Perl, but I won't hold
that against it <wink>.  It's about half a megabyte of source code spread
over about 100 files, and I'm afraid that's what the state of the art has