[Spambayes] Obvious Spam Missed...
bill at parducci.net
Wed Sep 17 13:23:13 EDT 2003
Skip Montanaro wrote:
Tim> One thing to watch out for is that if you put "too much" data into
Tim> the starter database, more additional training is needed to cater
Tim> to personal quirks than if a new user starts with an empty
> If we do something like that, I think we should train on a very small set of
> mails, maybe no more than 20-30 of each class. After all, I think all we're
> trying to do is give the new user's incoming mail an initial nudge in the
> right direction. Ideally, the mails should be spread across domains,
> senders and recipients. If that's not possible, header clues which relate
> to the senders or recipients should be deleted before shipping.
since the skew can work both ways (should someone like tim include their
extracurricular activities in the ham training sample :o), wouldn't it
make sense to create a number of initial databases with *only* spam in
them and let the user train an appropriate amount of ham as part of the
install? anecdotal evidence suggests that just about everyone has some
ham laying around, yet not everyone keeps spam about.
by having a couple of sample dbs (e.g. 10 spam, 50 spam, 100 spam) you
could offer the value of kick starting a db, while reducing the
potential for skew.
just a thought...
More information about the Spambayes