More years ago than I care to remember I did a Masters thesis on incorporating time dependent query terms in search queries used for searching "News" feeds. Part of the thesis involved implementing a test system. One of the many steps involved in the processing was the removal (or ignoring of) punctuation before searching for search tokens. I draw your attention to the following extract from a Spam Clues report 'beneficiary' 0.844828 0 1 'beneficiary.' 0.844828 0 1 I would argue that there is no difference between these two tokens and that the inclusion of the punctuation adds nothing to the process but in this instance is likely to give the token a lower score than may be appropriate. I further draw your attention to the following extracts from the same Spam Clues report: '+31633775038' 0.844828 0 1 '30%' 0.844828 0 1 '65%to' 0.844828 0 1 '7.5.430' 0.867197 4 2 '17/11/2006' 0.909938 1 2 '268.14.7/537' 0.909938 1 2 '5:56' 0.909938 1 2 While strings of numbers such as TCP/IP addresses may be useful in differentiating spam from ham, generally numbers, digits and amounts for currency are not good choices for tokens. In particular the above date '17/11/2006' and time '5:56' tokens can normally be considered to be random and are unlikely to be of any use in classifying spam/ham. I also used a stop list of words which are so common that they are useless to index or use in search engines or other search indexes. Below are a number of instances of words which I believe are not appropriate tokens to use to differentiate between spam and ham emails. 'under' 0.814607 3 1 'its' 0.862812 1 1 'us.' 0.862812 1 1 'our' 0.611666 16 2 'when' 0.637817 7 1 'that' 0.664752 19 3 'all' 0.674394 12 2 'around' 0.739628 4 1 'it,' 0.848794 1 1 'up,' 0.848794 1 1 'p.m.' 0.813589 7 2 'does' 0.814607 3 1 Generally I find the current version of SpamBayes to be a very useful tool but I would like the ability to permanently set the value of a token i.e. I'd like to be able to set the token 'pharmacy' to value 1.0 to ensure that all emails containing it are classified as spam; likewise I'd like to classify certain terms as having value 0.0 so that they are always classified as ham. Keep up the good work and I hope that my suggestions are worthwhile. Regards A.J. O'Neill M. App. Sc. M.B. Computing Grad. Dip. K.B.S.
Hi AJ,
'beneficiary' 0.844828 0 1 'beneficiary.' 0.844828 0 1 I would argue that there is no difference between these two tokens and that the inclusion of the punctuation adds nothing to the process but in this instance is likely to give the token a lower score than may be appropriate.
I disagree. In this particular case the distinction may not be clear, but, in some instances, especially when the spammer is a non-native English speaker, the near-random placement of punctuation in inappropriate locations can correctly identify spam.
While strings of numbers such as TCP/IP addresses may be useful in differentiating spam from ham, generally numbers, digits and amounts for currency are not good choices for tokens. In particular the above date '17/11/2006' and time '5:56' tokens can normally be considered to be random and are unlikely to be of any use in classifying spam/ham.
Yet again, tokenization for exclusion is not about "normalcy"; it's about distinction. Any token distinctive to a spam or ham message may help in determining a future messages value. For example, most chain letters have similar origins or have crossed certain paths that will leave timestamp tracks within the body of the message and in the header. Some spam munges date values to use tell-tale values (trust me, you'll know it when you see it), and quite often 419 spam will have distinct values within dollar figures, dates and times presented which SpamBayes can use to aid in the identification of unwanted email. Also, while all of these values taken individually may seem silly, they are NOT treated as singular processing values within the SpamBayes filtering system, but are joined with others to build a composite of all tokens - completely whitelisting tokens or token structures that /really do/ appear more frequently in spam would only serve to hinder SpamBayes' effectiveness.
I also used a stop list of words which are so common that they are useless to index or use in search engines or other search indexes. Below are a number of instances of words which I believe are not appropriate tokens to use to differentiate between spam and ham emails...
It appears that your concern is primarily with the shorter strings. There is a method of increasing the minimum token length - IIRC the process was detailed on this list about 3 months ago. You might consider taking those steps if you want to eliminate the shorter tokens altogether.
...I would like the ability to permanently set the value of a token i.e. I'd like to be able to set the token 'pharmacy' to value 1.0 to ensure that all emails containing it are classified as spam; likewise I'd like to classify certain terms as having value 0.0 so that they are always classified as ham.
I think strict "whitelist" and "blacklist" functionality would be quite useful, too. It might even help reduce my dependency on additional applications for spam processing. Regards, Shawn K. Hall http://12PointDesign.com/
Mr. A.J. O'Neill wrote on Friday, November 17, 2006 7:25 PM -0500:
One of the many steps involved in the processing was the removal (or ignoring of) punctuation before searching for search tokens. I draw your attention to the following extract from a Spam Clues report
'beneficiary' 0.844828 0 1 'beneficiary.' 0.844828 0 1
I would argue that there is no difference between these two tokens and that the inclusion of the punctuation adds nothing to the process but in this instance is likely to give the token a lower score than may be appropriate.
This type of specific choice in the tokenizer resulted from testing in a number of people's working environments. It was shown to improve classification empirically. This suggests that the intuition behind your argument, which I originally shared as well, is not correct for the purpose of classifying email as ham/spam at the time this was tested. A lot of the small choices in Spambayes turn out to be the results of empirical testing rather than intuition, and it's surprising (non-intuitive) how often our intuition about our own language is incorrect. If you're looking for a reason to explain the empirical results, one possibility is that it provides differentiation based on grammar, as opposed to just word occurrence. This is something that you normally don't get with a tokenizer that only recognizes words and not sentence structure.
I also used a stop list of words which are so common that they are useless to index or use in search engines or other search indexes. Below are a number of instances of words which I believe are not appropriate tokens to use to differentiate between spam and ham emails.
There is a clash between the philosophy of naive Bayesian classification and rule-based schemes. The idea behind rule-based schemes is that we can tap human beings' pattern recognition ability to create rules that we run in a computer. Since we can recognize spam easily when we see it, we are the best experts to consult when forming a rule set. The problem with this notion is that computers are not currently capable of creating inferences in the same way as people because the system architecture is so different. While people can indeed reliably distinguish spam, often from only a part of the message, they cannot reliably tell you how they made the decision. The aim of naive Bayesian classification is to avoid all the particular problems of trying to construct a useful rule set and instead look at simple statistical properties of language the do not require human-like inference. The underlying model is fundamentally different. A Bayesian classifier is not trying to emulate a speaker of natural language. The approach has strengths as well as weaknesses. One of the strengths is that you don't have to decide what words you think are the best or worst spam indicators. If you tend to favor rule-based approaches, this also looks like a huge weakness. The classifier learns word probabilities by observing your message classifications. To the extent that you are surprised by the spam probabilities of individual words, you would make the classifier worse by manually overriding the training results on a token-by-token basis. This happens far more often than you would think. Words that indicate a spam likeliness equal to a ham likeliness score somewhere near 0.5 and do not contribute to the final score. Another of the strengths is that the word probabilities vary widely among different recipients. It's a strength because there is no such thing as a ham word list that will reliably avoid Bayesian classifiers. That's also a weakness, if you wish to apply Bayesian methods on a server without tracking the word probabilities separately for each mailbox. What this suggests is that it is equally difficult to come up with a list of words that the classifier should ignore that would work for most users. There is a fundamental disagreement in the approaches of Bayesian and rule-based systems. Proponents of rule-based systems believe that people can best identify what clues are most significant, while proponents of Bayesian systems either believe that people cannot reliably identify the most important clues, or even if they can, they don't care to do so. The last condition is important if spam avoidance is simply a utilitarian goal, not a hobby. Personally, I tried rule-based systems first and then experimented with Spambayes. I found that my intuition on word probabilities was indeed wrong a significant proportion of the time and the naive Bayesian approach did about as well as my rule-based system when it was at its peak. The Bayesian approach required much less maintenance and it works well for a wide variety of end-users without requiring insight from them. I still feel there are very useful rules to help detect spam that are complimentary to word frequency. These are things such as whether the message comes from a particular mailing list, whether the sending IP is on a DNS blacklist that I choose or to which one of my mailbox addresses the message is addressed. My own compromise on this is to either put them in the domain MTA, or to write Outlook rules that run before the Bayesian classifier. In terms of overall system architecture, I tend to believe that the rule-based approaches belong in the domain MTA, whenever possible, and should generate rejections during the SMTP session, preferably before DATA. This eliminates most of the spam at the lowest possible system cost and with the largest savings in bandwidth. You can eliminate another significant amount of spam by running rule-based content filters, such as SpamAssassin, in the MTA. This is very expensive, so it is important to run it on as few messages as possible. This generates rejections at the end of DATA, which are still useful for legitimate messages that are improperly classified. For the spam that slips through global rule-base systems, it then makes sense to do computationally intensive and user-specific content filtering like Spambayes in the MUA. The spam load is hopefully reduced enough that the end-user doesn't mind scanning the junk folder for the occasional false positive. -- Seth Goodman
participants (3)
-
Mr. A.J. O'Neill -
Seth Goodman -
Shawn K. Hall