[spambayes-dev] imbalance within ham or spam training sets?

Kenny Pitt kennypitt at hotmail.com
Mon Nov 3 13:17:16 EST 2003


Skip Montanaro wrote:
> Suppose I've trained on exactly 1000 ham and 1000 spam,
> just to eliminate that source of problems.  Within the 1000 hams,
> suppose I've trained on 800 python messages, 100 messages about cars
> and 100 messages about pop psychology.  We know that if I get a
> message about a subject which I've never trained on before (say,
> woodworking) that there are likely to be topic-specific clues I've
> never seen which won't contribute to scoring the message as ham
> ("router", "lathe", "sawdust", ...). 
> 
> Questions:
> 
>     * How many woodworking messages will I need to train as ham to
>       get the system to properly recognize those messages as ham? 
>       Would that large glut of python-related messages hamper the
>       ability of the classifier to detect woodworking messages as ham?

I would think one would be sufficient, assuming of course that none of
the words in your woodworking message already appear in your *spam*
training.  SpamBayes only considers tokens that are *in* the message
being classified, not tokens that are *not in* the message.  So,
regardless of how many times a token has appeared in the python
messages, it will not even be considered in the scoring if it does not
appear in the woodworking message.  On the other hand, if that token
*does* appear in the woodworking message then it will be solidly scored
as ham and therefore increase the probability of the message being
correctly classified.

>     * Similarly, would the 8:1 ratio of python messages to messages
>       about cars or pop psychology have an effect on scoring any of
>       those messages accurately?

I wouldn't think so.  Since all of these messages are considered ham,
the tokens from the python messages would at best reinforce the
*correct* classification of the other messages, and at worst would
contribute nothing one way or the other to the scoring.

Just my thoughts, totally unproven scientifically.

-- 
Kenny Pitt




More information about the spambayes-dev mailing list