[Spambayes] tokenizing identical words

Rob Hooft rob@hooft.net
Sun, 06 Oct 2002 08:10:12 +0200


I have ony been following the tonenizer from a distance, but has it been 
tried yet to use logarithm tokens for multiple occurrences of a word? 
So, a spam mentioning Nigeria a couple times could result in "nigeria 
nigeria:2 nigeria:4 nigeria:8" tokens. I can imagine that the:16 is not 
going to mean a lot, but nigeria:4 like this message may quickly result 
in a spam score...

So: if you want to be removed, take your credit card, get rich quick, 
pay $100000 and click here: http://123456789/ :-)

Rob

PS: In my ham corpus there is a message of someone sending a list of all 
ISO country codes. In my spam corpus there is a spam that lists a lot of 
  countries where this company is selling stuff....

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/