[Spambayes] Mail classifiers, training sets and technical docs

Tim Peters tim.one at comcast.net
Mon Dec 30 12:50:47 EST 2002


[Anthony Baxter]
> A thought that occurs to me now - would it make more sense to instead
> provide a database seeded with a few obvious clues, rather than whole
> messages - for instance, start with a bunch of the standard "really
> really really bogus spam clues" from spamassassin?
>
> That way, people will hopefully start to get results immediately...
>
> Bah, brain foggy from too much Christmas, probably making no sense at
> all.

We ran tests "like that" before, based on a seed database derived from a
well-trained database, copying over only the words with very high spamprob
that had appeared "often" (so that their spamprobs are somewhat reliable).
The database then contains no words with spamprob < 0.5 (or, indeed, < 0.95,
if that's the "very high spamprob" cutoff used).  Predictably, that boosts
the false positive rate -- it's impossible for anything to score as ham,
unless ham_cutoff is also boosted above 0.5, so Unsure is the best realistic
classification you can hope for.  It recovers quickly after training.  But
then an empty database learns quickly too, and doesn't have to fight off
ghost spam <wink>.




More information about the Spambayes mailing list