[spambayes-dev] SpamBayes rules. New use. Suggestion for
improvement.
Skip Montanaro
skip at pobox.com
Mon Jul 28 07:54:34 EDT 2003
Fionn> Hello to all readers. This is my first post to the list so please
Fionn> bear with me if I should bring up anything that has been there
Fionn> already. The MailMan Archives are missing any search
Fionn> functionality, for that matter.
Known problem. Your best bet is to use Google. For example, searching for
site:mail.python.org pop3proxy -checkins
will display messages about pop3proxy but exclude the checkin messages.
I updated question 1.2 of the FAQ to show how to do this.
Fionn> As far as I can see there is e.g. no token that indicates the
Fionn> length of a message. It might even be advisable to specify the
Fionn> length not in words but in tokens. I just looked over last days
Fionn> logs and would estimate that about 50% of all spam is less than
Fionn> 75 tokens, about 90% is less than 250 tokens and hardly any spam
Fionn> at all gets together 1000 or more tokens. So, a special token,
Fionn> bearing the length of a mail in a form like t-length:
Fionn> [<500|>500|>1000|>2000|>5000] might be a useful indicator against
Fionn> spams for mail like the one mentioned above which was a pretty
Fionn> long one.
See the FAQ, question 5.1. For its designed use Spambayes is so good that
very few extensions which has been tried in the past several months have
yielded any improvements. It's possible that in your multi-user environment
adding a token which expresses the size of the message will help. It should
be easy to try out, just increment a counter before every yield statement
and yield that at the end. You might want to try a few variations:
yield counter
yield counter // <some bucket size>
yield log2(counter) # log2 is a function local to tokenizer.py
Skip
More information about the spambayes-dev
mailing list