[Spambayes] Spam that bypasses spambayes
Tim Peters
tim.one at comcast.net
Fri Sep 12 00:08:18 EDT 2003
[Harri Pesonen]
> I had an idea a couple of weeks ago, that all url tokens should have
> more weight than other tokens. The spammer just wants you to click on
> some url, so the other text is not so important. They could even put
> random words there, and they have.
Code it and try it. spambayes *used* to have fancier URL tokenization than
it has now, and results got better by simplifying it -- there's no
substitute for testing ideas in a statistical system, and *everything* you
try will have both good effects and bad effects (there are no pure wins).
The best you can hope for is that the good outweigh the bad across a large
variety of test sets, and there's no way to determine that without testing
on a large variety of test sets.
> And maybe the url server address should be tokenized the same way as
> the address is tokenized in Received header. So the address below
> would yield
>
> url:biz
> url:gadgitz.biz
> url:www.gadgitz.biz
>
> Now it just does
>
> url:www
> url:gadgitz
> url:biz
I believe you're talking about
http://www.gadgitz.biz/promo.php?id=93778
That actually generates 7 url tokens today:
url:93778
url:biz
url:gadgitz
url:id
url:php
url:promo
url:www
and a
proto:http
token. Of these, the url:biz token has the highest spamprob in my database
today, and url:id isn't far behind it. Curiously, url:promo has an only
slightly spammy spamprob for me. There are no tokens in my database
containing the string gadgitz, so generating more tokens containing that
string wouldn't have helped. Given that spammers lose their domains only
slightly less frequently than they lose their email addresses, loading the
database with more spammer domain names du jour doesn't sound like a good
bet either.
> Maybe it should do both and have more weight that way. Also decode %
> encoding and find server names for ip addresses... :-)
I hesitate to put in anything by default that requires going off the local
machine (whether to suck down a web page or just to do a DNS lookup). That
may be OK in an industrial-strength setting with industrial-strength
connectivity, but lots of users are stuck on slow dialup lines to sluggish
ISPs. Controlling such stuff by options, disabled by default, would be
fine.
More information about the Spambayes
mailing list