[Spambayes] Obvious Spam Missed...

bill parducci bill at parducci.net
Wed Sep 17 13:23:13 EDT 2003


Skip Montanaro wrote:

Tim> One thing to watch out for is that if you put "too much" data into
Tim> the starter database, more additional training is needed to cater
Tim> to personal quirks than if a new user starts with an empty
Tim> database.
> 
> If we do something like that, I think we should train on a very small set of
> mails, maybe no more than 20-30 of each class.  After all, I think all we're
> trying to do is give the new user's incoming mail an initial nudge in the
> right direction.  Ideally, the mails should be spread across domains,
> senders and recipients.  If that's not possible, header clues which relate
> to the senders or recipients should be deleted before shipping.

since the skew can work both ways (should someone like tim include their 
extracurricular activities in the ham training sample :o), wouldn't it 
make sense to create a number of initial databases with *only* spam in 
them and let the user train an appropriate amount of ham as part of the 
install? anecdotal evidence suggests that just about everyone has some 
ham laying around, yet not everyone keeps spam about.

by having a couple of sample dbs (e.g. 10 spam, 50 spam, 100 spam) you 
could offer the value of kick starting a db,  while reducing the 
potential for skew.

just a thought...

b




More information about the Spambayes mailing list