I'm interested in contributing to GBayes ..
I'm thinking of trying word stemming and adding other types of token indicators. How can I contribute?
Pretty soon, a SF propject will be created (Barry has already gotten the request in). We'll gladly add you to the list of developers.
Btw, I have been saving up my spam for a year or so.. I have about 31,238 spam messages saved up now. These are categorized as spam based on my reading of the subject, or examining the body when in doubt. There are probably 10% dups in the corpus. Some of them have viruses, likely klez.
I'd like to replicate Tim's test rig so I can compare my results with existing ones. My spam isn't in mbox format, but I can convert it..
If you can't wait for the SF project, you can find all the code in the Python CVS tree: http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox...
I'm particularly intersted in how to allow html only messages (reduce false positives). I'm getting a lot of personal mail in that format, unfortunately.
You train it with an equal number of spam and non-spam ("ham") that you received. Just make sure the ham training messages contain enough representatives of the html-only mail. --Guido van Rossum (home page: http://www.python.org/~guido/)