[Spambayes] Latest spammer trick stymied

Richard Jowsey richard at jowsey.com
Mon Mar 31 21:14:56 EST 2003

Lately (as prophesied), there have been a number of very short spams 
arriving, containing only a singleton URL. My proxy's classifier was 
giving these an "unsure" rating -- too few clues. But, these buggers 
were starting to become quite annoying...

So today I added a simple web-crawler, which will venture out on 
demand and slurp the words off any site. This little hoover is only 
unleashed when the number of distinct clues/words in an email is less 
than 150, it's heading for the "unsure" bucket, and we find an http 
URL in there. The entire source HTML is then whacked through the 
tokenizer and classified.

The extra servlet processing can take a couple seconds, mostly 
network overhead, and really only noticeable when paying close 
attention to message download times, but the results are really worth 
it! It nails them dead.


More information about the Spambayes mailing list