On Wednesday 21 August 2002 10:48 pm, Erik Max Francis wrote:
> As I said earlier, one blocking issue for me in actually putting the
> filter into practice is the lack of good corpora (one for spam, one for
> non-spam); I keep all mail I receive, but the "backups" that I have
> usually consist of all the email I've ever received.  (I certainly have
> kept a lot of good mail, but of course I've deleted a lot more, so it's
> hard to know whether or not it would be useful.)  Note that if, from now
> on, I did manage to keep a corpus of all good email I've received
> alongside all email (both good and bad), it would be easy to apply
> simple subtraction to determine the good and bad figures (which are
> needed by Graham's algorithm), but what I have now consists of only some
> good messages going back through time and all email I've ever received
> (good and bad) since I switched over to my new rule-based Python filter.

Since I read that article I created a spam folder and moved all spam there 
rather than delete it.  I now have 400 or so messages in that folder.  Should 
be a sufficient corpus and it grows daily.

An interesting issue for me is the contents of the spam.  Some 70% of my spam 
is Asian so there is a strong chance that any mail with CJK words will appear 
to be spam, especially Korean.

