[Python-Dev] Getting started with GBayes testing
Guido van Rossum
guido@python.org
Wed, 04 Sep 2002 20:24:29 -0400
> I'm interested in contributing to GBayes ..
>
> I'm thinking of trying word stemming and adding other types of token
> indicators. How can I contribute?
Pretty soon, a SF propject will be created (Barry has already gotten
the request in). We'll gladly add you to the list of developers.
> Btw, I have been saving up my spam for a year or so.. I have about
> 31,238 spam messages saved up now. These are categorized as spam
> based on my reading of the subject, or examining the body when in
> doubt. There are probably 10% dups in the corpus. Some of them have
> viruses, likely klez.
Cool.
> I'd like to replicate Tim's test rig so I can compare my results
> with existing ones. My spam isn't in mbox format, but I can convert
> it..
If you can't wait for the SF project, you can find all the code in the
Python CVS tree:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/spambayes/
> I'm particularly intersted in how to allow html only messages
> (reduce false positives). I'm getting a lot of personal mail in
> that format, unfortunately.
You train it with an equal number of spam and non-spam ("ham") that
you received. Just make sure the ham training messages contain enough
representatives of the html-only mail.
--Guido van Rossum (home page: http://www.python.org/~guido/)