[spambayes-dev] RE: [Pydotorg] Re: Generating SB tokens based upon
information on thenet
Tim Peters
tim.one at comcast.net
Thu Aug 5 06:00:35 CEST 2004
Lest anyone forget <wink>, SpamBayes was originally developed using a
python.org mail corpus as "ham", consisting of tens of thousands of
"blessed" tech mailing list msgs, hundreds of which turned out to be false
negatives, cleaned from the corpus over a period of months as SpamBayes got
better at discovering them (the large number of bogus "ham" really hurt at
the start -- garbage in, garbage out).
The classifier achieved the fabled "four nines" accuracy on that traffic in
controlled tests, and showed no possible improvement remaining to be made
(there were no false negatives remaining, and the 3 to 9 false positives
remaining were technically ham but likely impossible for any useful system
to identify as ham -- like the one-time poster to comp.lang.python who
quoted an entire Nigerian scam spam with a one-line "this is a scam" comment
at the start).
SpamBayes doesn't need more info to do a stellar job on tech mailing list
traffic (more might make for a tiny improvement, measurable only in a
very-large-scale controlled test), but what it does need is ongoing
training. I don't know whether the latter is feasible.
More information about the spambayes-dev
mailing list