skip at pobox.com
Fri Dec 26 13:36:09 EST 2003
Dave> I keep getting quite a few spams which fit the descriptions below
Dave> (from NEWTRICKS.txt):
Dave> - Punctuation sometimes gets inserted in otherwise spammy words
Dave> or phrases, e.g.: "Ch-eck ou=t ou-r sel)ection _of grea)t R_X
Dave> -emgffj". It might be helpful to try stripping punctuation.
Dave> (Idea from Paul Sorenson)
Dave> - Similarly, some letters get replaced by numbers, e.g.:
Dave> "V1agra" instead of "Viagra". Mapping numbers to suitable
Dave> letters might help in some situations.
Dave> Since "this file is for ideas that have or have not yet been
Dave> tried", I'd love to know what constitutes "trying". Is there some
Dave> official testing procedure or corpus we can test against? I'd
Dave> like to know whether any change I make is worth proposing. Of
Dave> course I can try it on my own databases of Ham and Spam first...
I tried the first (eliding punctuation from words). From a testing
standpoint it turns out to not be all that useful, I think for a couple
* There are plenty of other spammy clues in such messages which are
sufficient to kick these messages into spam range. Most of this stuff
winds up scoring at 0.95 or above for me. If they don't score as spam
for you, train on a few and see how it does then.
* Training databases full of old-ish mail won't contain many of these
sorts of messages, so enabling punctuation removal won't change things
More information about the spambayes-dev