Graham's spam filter
Sean 'Shaleh' Perry
shalehperry at attbi.com
Thu Aug 22 09:06:44 CEST 2002
On Wednesday 21 August 2002 10:48 pm, Erik Max Francis wrote:
> As I said earlier, one blocking issue for me in actually putting the
> filter into practice is the lack of good corpora (one for spam, one for
> non-spam); I keep all mail I receive, but the "backups" that I have
> usually consist of all the email I've ever received. (I certainly have
> kept a lot of good mail, but of course I've deleted a lot more, so it's
> hard to know whether or not it would be useful.) Note that if, from now
> on, I did manage to keep a corpus of all good email I've received
> alongside all email (both good and bad), it would be easy to apply
> simple subtraction to determine the good and bad figures (which are
> needed by Graham's algorithm), but what I have now consists of only some
> good messages going back through time and all email I've ever received
> (good and bad) since I switched over to my new rule-based Python filter.
Since I read that article I created a spam folder and moved all spam there
rather than delete it. I now have 400 or so messages in that folder. Should
be a sufficient corpus and it grows daily.
An interesting issue for me is the contents of the spam. Some 70% of my spam
is Asian so there is a strong chance that any mail with CJK words will appear
to be spam, especially Korean.
More information about the Python-list