[Python-Dev] Getting started with GBayes testing

Guido van Rossum guido@python.org
Wed, 04 Sep 2002 20:24:29 -0400


> I'm interested in contributing to GBayes ..
> 
> I'm thinking of trying word stemming and adding other types of token
> indicators. How can I contribute?

Pretty soon, a SF propject will be created (Barry has already gotten
the request in).  We'll gladly add you to the list of developers.

> Btw, I have been saving up my spam for a year or so.. I have about
> 31,238 spam messages saved up now. These are categorized as spam
> based on my reading of the subject, or examining the body when in
> doubt. There are probably 10% dups in the corpus. Some of them have
> viruses, likely klez.

Cool.

> I'd like to replicate Tim's test rig so I can compare my results
> with existing ones. My spam isn't in mbox format, but I can convert
> it..

If you can't wait for the SF project, you can find all the code in the
Python CVS tree:

  http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/spambayes/

> I'm particularly intersted in how to allow html only messages
> (reduce false positives).  I'm getting a lot of personal mail in
> that format, unfortunately.

You train it with an equal number of spam and non-spam ("ham") that
you received.  Just make sure the ham training messages contain enough
representatives of the html-only mail.

--Guido van Rossum (home page: http://www.python.org/~guido/)