[Spambayes] Latest spammer trick stymied
richard at jowsey.com
Tue Apr 1 08:52:35 EST 2003
> >> We definitely should NOT crawl the site, just in case it really
> is an >> innocent url. The load can crush a site, particularly if
> it's >> hosted.
> Richard> Nah. You need to throw thousands of requests at a
> half-decent Richard> web server before it gives up the ghost. And
> if they're sending Richard> out 10 million mail pieces, they
> should expect their http Richard> server to take some load. These
> are definitely NOT innocent Richard> emails. They come from bogus
> senders, have minimal headers Richard> (deliberately), and contain
> *nothing* but a url. Which points, Richard> via redirect
> naturally, to an incest porn or get-a-huge-penis Richard> site,
> You can't make that judgement beforehand. If the site you are poking
> is a valid site and the email received was not spam, none of what you
> said holds. If I remember correctly, you said this was only to be
> performed in circumstances where certain criteria were met, none of
> which included a conclusion the mail was spam.
Skip, I agree absolutely! We certainly can't assume that an email
containing only a singleton url is spam. NB: when friends or
colleagues send me a single url, as they do, such messages already
get a "good" classification. No problem there.
But the same kind of message from a spammer ends up as "unsure",
primarily because there's simply not enough clues to be definite
about its classification. Actually, what prompted this whole question
was a "complaint" from one of my proxy beta testers about one of
these spams. He reckoned it was "bloody obvious" that the message was
junk. The classifier disagreed. I went looking for a simple solution!
My *only* criteria for poking the url (rather ironic choice of verb,
considering the sites in question ;-) are:
1. an "unsure" classification
2. number of clues < 150 (or whatever max_discriminators one has)
3. a URL in the message body
If the url happens to point at an "innocent" site, this extra bit of
information-gathering will simply tip the message over to the "good"
bucket. A Good Thing, no harm done. And, in this (unusual) case,
there's definitely no extra load happening at the web server,
precisely because there weren't millions of this email getting
More information about the Spambayes