
Some perhaps relevant links (with no off-topic discusssion): * http://www.tuxedo.org/~esr/bogofilter/ * http://www.ai.mit.edu/~jrennie/ifile/ * http://groups.google.com/groups?selm=ajk8mj%241c3qah%243%40ID-125932.news.df... """My finding is that it is _nowhere_ near sufficient to have two populations, "spam" versus "not spam." If you muddle together the Nigerian Pyramid schemes with the "Penis enhancement" ads along with the offers of new credit cards as well as the latest sites where you can talk to "hot, horny girls LIVE!", the statistics don't work out nearly so well. It's hard to tell, on the face of it, why Nigerian scams _should_ be considered textually similar to phone sex ads, and in practice, the result of throwing them all together" There are a few things left to improve about Ifile, and I'd like to redo it in some language fundamentally less painful to work with than C """ "Barry A. Warsaw" wrote:
"SM" == Skip Montanaro <skip@pobox.com> writes:
tim> Straight character n-grams are very appealing because they're tim> the simplest and most language-neutral; I didn't have any tim> luck with them over the weekend, but the size of my training tim> data was trivial.
SM> Anybody up for pooling corpi (corpora?)?
I've got collections from python-dev, python-list, edu-sig, mailman-developers, and zope3-dev, chopped at Feb 2002, which is approximately when Greg installed SpamAssassin. The collections are /all/ known good, but pretty close (they should be verified by hand).
The idea is to take some random subsets of these, cat them together and use them as both training and test data, along with some 'net-available known spam collections.
No time more to play with this today though... -Barry
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev
-- Paul Prescod