[Spambayes] Feature improvement idea - Grammar and spell check rating!

Tony Meyer tameyer at ihug.co.nz
Wed Jul 19 23:41:43 CEST 2006


> At the risk of bloating this awesome piece of software, I submit  
> that Grammar and Spell checking (at least localized English) would  
> be a good way to easily identify illegitimate email.

Grammar checking is difficult, but various methods of generating  
tokens based on spell-checking have been evaluated in the past, and  
found to be ineffective.  For example:

[ 817813 ] Consider bad spelling a sign of spam
http://sourceforge.net/tracker/index.php? 
func=detail&aid=817813&group_id=61702&atid=498106

I suspect that the problems with this include:

  * Many people 'misspell' words in legitimate email (abbreviations,  
slang, proper nouns, typos, and so on)

  * Spam that tries to hide behind misspelled words is generally  
already caught; it is other spam (e.g. image-based) that really  
causes problems these days.

This is perhaps a more corpus-dependent feature than others - for  
example, I suspect that on a primarily business-orientated email  
stream the results would be somewhat better (since work email tends  
to be better spelt, although there are certainly plenty of exceptions  
to that rule).

I haven't done any tests, but my expectation would be that grammar  
checking would be even worse, since few English-as-a-first-language  
speakers have any idea of what correct English grammar is.  (I expect  
that, for example, comma splices and incomplete sentences would be  
just as common in ham as in spam).

=Tony.Meyer

-- 
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.




More information about the SpamBayes mailing list