[Spambayes] Latest spammer trick stymied
richard at jowsey.com
Mon Mar 31 21:14:56 EST 2003
Lately (as prophesied), there have been a number of very short spams
arriving, containing only a singleton URL. My proxy's classifier was
giving these an "unsure" rating -- too few clues. But, these buggers
were starting to become quite annoying...
So today I added a simple web-crawler, which will venture out on
demand and slurp the words off any site. This little hoover is only
unleashed when the number of distinct clues/words in an email is less
than 150, it's heading for the "unsure" bucket, and we find an http
URL in there. The entire source HTML is then whacked through the
tokenizer and classified.
The extra servlet processing can take a couple seconds, mostly
network overhead, and really only noticeable when paying close
attention to message download times, but the results are really worth
it! It nails them dead.
More information about the Spambayes