[spambayes-dev] SpamBayes rules. New use. Suggestion for improvement.

Mon Jul 28 07:54:34 EDT 2003

    Fionn> Hello to all readers. This is my first post to the list so please
    Fionn> bear with me if I should bring up anything that has been there
    Fionn> already. The MailMan Archives are missing any search
    Fionn> functionality, for that matter.

Known problem.  Your best bet is to use Google.  For example, searching for

    site:mail.python.org pop3proxy -checkins

will display messages about pop3proxy but exclude the checkin messages.
I updated question 1.2 of the FAQ to show how to do this.

    Fionn> As far as I can see there is e.g. no token that indicates the
    Fionn> length of a message. It might even be advisable to specify the
    Fionn> length not in words but in tokens. I just looked over last days
    Fionn> logs and would estimate that about 50% of all spam is less than
    Fionn> 75 tokens, about 90% is less than 250 tokens and hardly any spam
    Fionn> at all gets together 1000 or more tokens. So, a special token,
    Fionn> bearing the length of a mail in a form like t-length:
    Fionn> [<500|>500|>1000|>2000|>5000] might be a useful indicator against
    Fionn> spams for mail like the one mentioned above which was a pretty
    Fionn> long one.

See the FAQ, question 5.1.  For its designed use Spambayes is so good that
very few extensions which has been tried in the past several months have
yielded any improvements.  It's possible that in your multi-user environment
adding a token which expresses the size of the message will help.  It should
be easy to try out, just increment a counter before every yield statement
and yield that at the end.  You might want to try a few variations:

    yield counter
    yield counter // <some bucket size>
    yield log2(counter)    # log2 is a function local to tokenizer.py

Skip