[Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8

Barry A. Warsaw barry@python.org
Tue, 20 Aug 2002 17:23:12 -0400


>>>>> "SM" == Skip Montanaro <skip@pobox.com> writes:

    tim> Straight character n-grams are very appealing because they're
    tim> the simplest and most language-neutral; I didn't have any
    tim> luck with them over the weekend, but the size of my training
    tim> data was trivial.

    SM> Anybody up for pooling corpi (corpora?)?

I've got collections from python-dev, python-list, edu-sig,
mailman-developers, and zope3-dev, chopped at Feb 2002, which is
approximately when Greg installed SpamAssassin.  The collections are
/all/ known good, but pretty close (they should be verified by hand).

The idea is to take some random subsets of these, cat them together
and use them as both training and test data, along with some
'net-available known spam collections.

No time more to play with this today though...
-Barry