[spambayes-dev] imbalance within ham or spam training sets?

Skip Montanaro skip at pobox.com
Mon Nov 3 15:47:04 EST 2003


    >> Let me rephrase the question again.  There's a discussion in Gary
    >> Robinson's LJ article
    >> 
    >> http://www.linuxjournal.com/article.php?sid=6467
    >> 
    >> about dealing with rare words which I didn't really follow.

    alex> It's talking about the math behind unknown_word_strength and
    alex> unknown_word_prob.

    >> If I've trained on 1000 other ham messages and now encounter a
    >> woodworking message, some of the words in there are likely to have
    >> not been seen before ("lathe", for example).  Such words obviously
    >> can't contribute to scoring that message.  Let's assume I then train
    >> that message as ham.  "lathe" now has a hamcount of 1 and a spamcount
    >> of 0.  It is a "rare word".  How many more messages which contain
    >> "lathe" do I have to train on before it is no longer "rare".

    alex> A word is not "rare" or "not rare" according to the
    alex> classifier...

I understand that it's not a binary thing.  I used that term because Gary
used it in his article.

I seem to be having trouble making my ideas understood today...  Was my
exposition that vague?

    >> If there is a problem, it might be fairly easy to fall into a trap
    >> which is a bit difficult to get out of.

    alex> Lucky for us, there is no problem here. ;-)

That's all I was asking.

Skip



More information about the spambayes-dev mailing list