[Spambayes] Latest spammer trick stymied

Richard Jowsey richard at jowsey.com
Tue Apr 1 08:52:35 EST 2003

>  >> We definitely should NOT crawl the site, just in case it really
>  is an >> innocent url.  The load can crush a site, particularly if
>  it's >> hosted.
>  Richard> Nah. You need to throw thousands of requests at a
>  half-decent Richard> web server before it gives up the ghost. And
>  if they're sending Richard> out 10 million mail pieces, they
>  should expect their http Richard> server to take some load. These
>  are definitely NOT innocent Richard> emails. They come from bogus
>  senders, have minimal headers Richard> (deliberately), and contain
>  *nothing* but a url. Which points, Richard> via redirect
>  naturally, to an incest porn or get-a-huge-penis Richard> site,
>  etc.
> You can't make that judgement beforehand.  If the site you are poking
> is a valid site and the email received was not spam, none of what you
> said holds. If I remember correctly, you said this was only to be
> performed in circumstances where certain criteria were met, none of
> which included a conclusion the mail was spam.

Skip, I agree absolutely! We certainly can't assume that an email 
containing only a singleton url is spam. NB: when friends or 
colleagues send me a single url, as they do, such messages already 
get a "good" classification. No problem there.

But the same kind of message from a spammer ends up as "unsure", 
primarily because there's simply not enough clues to be definite 
about its classification. Actually, what prompted this whole question 
was a "complaint" from one of my proxy beta testers about one of 
these spams. He reckoned it was "bloody obvious" that the message was 
junk. The classifier disagreed. I went looking for a simple solution!

My *only* criteria for poking the url (rather ironic choice of verb, 
considering the sites in question ;-) are:
   1. an "unsure" classification
   2. number of clues < 150 (or whatever max_discriminators one has)
   3. a URL in the message body

If the url happens to point at an "innocent" site, this extra bit of 
information-gathering will simply tip the message over to the "good" 
bucket. A Good Thing, no harm done. And, in this (unusual) case, 
there's definitely no extra load happening at the web server, 
precisely because there weren't millions of this email getting 
blasted out...




More information about the Spambayes mailing list