[Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes
Tue, 20 Aug 2002 16:06:34 -0700
Some perhaps relevant links (with no off-topic discusssion):
"""My finding is that it is _nowhere_ near sufficient to have two
populations, "spam" versus "not spam."
If you muddle together the Nigerian Pyramid schemes with the "Penis
enhancement" ads along with the offers of new credit cards as well as
the latest sites where you can talk to "hot, horny girls LIVE!", the
statistics don't work out nearly so well.
It's hard to tell, on the face of it, why Nigerian scams _should_ be
considered textually similar to phone sex ads, and in practice, the
result of throwing them all together"
There are a few things left to improve about Ifile, and I'd like to
redo it in some language fundamentally less painful to work with than
"Barry A. Warsaw" wrote:
> >>>>> "SM" == Skip Montanaro <email@example.com> writes:
> tim> Straight character n-grams are very appealing because they're
> tim> the simplest and most language-neutral; I didn't have any
> tim> luck with them over the weekend, but the size of my training
> tim> data was trivial.
> SM> Anybody up for pooling corpi (corpora?)?
> I've got collections from python-dev, python-list, edu-sig,
> mailman-developers, and zope3-dev, chopped at Feb 2002, which is
> approximately when Greg installed SpamAssassin. The collections are
> /all/ known good, but pretty close (they should be verified by hand).
> The idea is to take some random subsets of these, cat them together
> and use them as both training and test data, along with some
> 'net-available known spam collections.
> No time more to play with this today though...
> Python-Dev mailing list