[Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8
Barry A. Warsaw
barry@python.org
Tue, 20 Aug 2002 17:23:12 -0400
>>>>> "SM" == Skip Montanaro <skip@pobox.com> writes:
tim> Straight character n-grams are very appealing because they're
tim> the simplest and most language-neutral; I didn't have any
tim> luck with them over the weekend, but the size of my training
tim> data was trivial.
SM> Anybody up for pooling corpi (corpora?)?
I've got collections from python-dev, python-list, edu-sig,
mailman-developers, and zope3-dev, chopped at Feb 2002, which is
approximately when Greg installed SpamAssassin. The collections are
/all/ known good, but pretty close (they should be verified by hand).
The idea is to take some random subsets of these, cat them together
and use them as both training and test data, along with some
'net-available known spam collections.
No time more to play with this today though...
-Barry