[spambayes-dev] imbalance within ham or spam training sets?
Skip Montanaro
skip at pobox.com
Mon Nov 3 14:09:04 EST 2003
>> * How many woodworking messages will I need to train as ham to get
>> the system to properly recognize those messages as ham? Would that
>> large glut of python-related messages hamper the ability of the
>> classifier to detect woodworking messages as ham?
Kenny> I would think one would be sufficient, assuming of course that
Kenny> none of the words in your woodworking message already appear in
Kenny> your *spam* training. SpamBayes only considers tokens that are
Kenny> *in* the message being classified, not tokens that are *not in*
Kenny> the message. So, regardless of how many times a token has
Kenny> appeared in the python messages, it will not even be considered
Kenny> in the scoring if it does not appear in the woodworking message.
Kenny> On the other hand, if that token *does* appear in the woodworking
Kenny> message then it will be solidly scored as ham and therefore
Kenny> increase the probability of the message being correctly
Kenny> classified.
Let me rephrase the question again. There's a discussion in Gary Robinson's
LJ article
http://www.linuxjournal.com/article.php?sid=6467
about dealing with rare words which I didn't really follow. If I've trained
on 1000 other ham messages and now encounter a woodworking message, some of
the words in there are likely to have not been seen before ("lathe", for
example). Such words obviously can't contribute to scoring that message.
Let's assume I then train that message as ham. "lathe" now has a hamcount
of 1 and a spamcount of 0. It is a "rare word". How many more messages
which contain "lathe" do I have to train on before it is no longer "rare".
In particular, by training on 1000 other hams which don't contain that word,
have I somehow created an artificial barrier to getting woodworking-specific
words to have full effect as ham indicators?
If there is a problem, it might be fairly easy to fall into a trap which is
a bit difficult to get out of. Suppose I'm starting from scratch and I know
I have several mailboxes:
* python - 800 messages
* cars - 100 messages
* pop-psycology - 100 messages
* spam - 1000 messages
As a new user, it might be very easy for me to ask SB to score all messages
in the first three mailboxes as ham and all in the fourth as spam, thus
creating a problem (if one exists). *If* such a problem exists (and it very
well may not), it might be better if I could tell the system to pick a
random sample of each of my collections such that the relative number of
hams and spams is about equal and so that the imbalance between mailboxes
classified as ham or spam is not too great either.
Skip
More information about the spambayes-dev
mailing list