[Spambayes] Feature improvement idea - Grammar and spell check rating!
Tony Meyer
tameyer at ihug.co.nz
Wed Jul 19 23:41:43 CEST 2006
> At the risk of bloating this awesome piece of software, I submit
> that Grammar and Spell checking (at least localized English) would
> be a good way to easily identify illegitimate email.
Grammar checking is difficult, but various methods of generating
tokens based on spell-checking have been evaluated in the past, and
found to be ineffective. For example:
[ 817813 ] Consider bad spelling a sign of spam
http://sourceforge.net/tracker/index.php?
func=detail&aid=817813&group_id=61702&atid=498106
I suspect that the problems with this include:
* Many people 'misspell' words in legitimate email (abbreviations,
slang, proper nouns, typos, and so on)
* Spam that tries to hide behind misspelled words is generally
already caught; it is other spam (e.g. image-based) that really
causes problems these days.
This is perhaps a more corpus-dependent feature than others - for
example, I suspect that on a primarily business-orientated email
stream the results would be somewhat better (since work email tends
to be better spelt, although there are certainly plenty of exceptions
to that rule).
I haven't done any tests, but my expectation would be that grammar
checking would be even worse, since few English-as-a-first-language
speakers have any idea of what correct English grammar is. (I expect
that, for example, comma splices and incomplete sentences would be
just as common in ham as in spam).
=Tony.Meyer
--
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.
More information about the SpamBayes
mailing list