Shawn K. Hall
shawn at 12pointdesign.com
Sat Nov 18 14:18:49 CET 2006
> 'beneficiary' 0.844828 0 1
> 'beneficiary.' 0.844828 0 1
> I would argue that there is no difference between these two
> tokens and that the inclusion of the punctuation adds nothing
> to the process but in this instance is likely to give the
> token a lower score than may be appropriate.
I disagree. In this particular case the distinction may not be clear,
but, in some instances, especially when the spammer is a non-native
English speaker, the near-random placement of punctuation in
inappropriate locations can correctly identify spam.
> While strings of numbers such as TCP/IP addresses may be
> useful in differentiating spam from ham, generally numbers,
> digits and amounts for currency are not good choices for
> tokens. In particular the above date '17/11/2006' and time
> '5:56' tokens can normally be considered to be random and are
> unlikely to be of any use in classifying spam/ham.
Yet again, tokenization for exclusion is not about "normalcy"; it's
about distinction. Any token distinctive to a spam or ham message may
help in determining a future messages value. For example, most chain
letters have similar origins or have crossed certain paths that will
leave timestamp tracks within the body of the message and in the header.
Some spam munges date values to use tell-tale values (trust me, you'll
know it when you see it), and quite often 419 spam will have distinct
values within dollar figures, dates and times presented which SpamBayes
can use to aid in the identification of unwanted email.
Also, while all of these values taken individually may seem silly, they
are NOT treated as singular processing values within the SpamBayes
filtering system, but are joined with others to build a composite of all
tokens - completely whitelisting tokens or token structures that /really
do/ appear more frequently in spam would only serve to hinder SpamBayes'
> I also used a stop list of words which are so common that
> they are useless to index or use in search engines or other
> search indexes. Below are a number of instances of words
> which I believe are not appropriate tokens to use to
> differentiate between spam and ham emails...
It appears that your concern is primarily with the shorter strings.
There is a method of increasing the minimum token length - IIRC the
process was detailed on this list about 3 months ago. You might consider
taking those steps if you want to eliminate the shorter tokens
> ...I would like the ability to permanently set the value of a
> token i.e. I'd like to be able to set the token 'pharmacy' to
> value 1.0 to ensure that all emails containing it are
> classified as spam; likewise I'd like to classify certain terms
> as having value 0.0 so that they are always classified as ham.
I think strict "whitelist" and "blacklist" functionality would be quite
useful, too. It might even help reduce my dependency on additional
applications for spam processing.
Shawn K. Hall
More information about the SpamBayes