[Spambayes] tokenizing identical words
Rob Hooft
rob@hooft.net
Sun, 06 Oct 2002 08:10:12 +0200
I have ony been following the tonenizer from a distance, but has it been
tried yet to use logarithm tokens for multiple occurrences of a word?
So, a spam mentioning Nigeria a couple times could result in "nigeria
nigeria:2 nigeria:4 nigeria:8" tokens. I can imagine that the:16 is not
going to mean a lot, but nigeria:4 like this message may quickly result
in a spam score...
So: if you want to be removed, take your credit card, get rich quick,
pay $100000 and click here: http://123456789/ :-)
Rob
PS: In my ham corpus there is a message of someone sending a list of all
ISO country codes. In my spam corpus there is a spam that lists a lot of
countries where this company is selling stuff....
--
Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/