[Spambayes] Re: Move closer to Gary's ideal

Skip Montanaro skip@pobox.com
Sat, 21 Sep 2002 09:48:15 -0500


    Guido> I guess this would require an X-Spam-Probability: <float> header,
    Guido> rather than a Yes/No header, either in addition or instead of the
    Guido> Yes/No header.  If you also want to provide a Yes/No header,
    Guido> you'd need a database of user preferences plus a UI for it.  But
    Guido> only a numeric value is probably hard for the average mail
    Guido> program to filter on...

Yes, which is why SpamAssassin reports its findings several different ways.
Just to review for people who aren't familiar with it, here are the SA
headers added to a recent spam I received.  (The last two headers are not
score-related.  The X-Spam-Prev-Content-Type: header is used to defang
messages with complex MIME types.  They are delivered as text/plain.)

    X-Spam-Status: Yes, hits=16.4 required=5.0
            tests=FROM_NAME_NO_SPACES,NO_REAL_NAME,FRONTPAGE,BIG_FONT,
                  MIME_EXCESSIVE_QP,CHARSET_FARAWAY_HEADERS,NO_MX_FOR_FROM,
                  MSG_ID_ADDED_BY_MTA_3,MISSING_HEADERS
            version=2.31
    X-Spam-Flag: YES
    X-Spam-Level: ****************
    X-Spam-Checker-Version: SpamAssassin 2.31 (devel $Id: SpamAssassin.pm,v 1.94.2.2 2002/06/20 17:20:29 hughescr Exp $)
    X-Spam-Prev-Content-Type: multipart/related; boundary="a7e59999-cd9f-11d6-8d6b-00e04c6236e7"

Each "*" counts for 1, so since the computed value was 16.4, the
X-Spam-Level: header contains 16 stars.  That and the X-Spam-Flag: header or
the beginning of the X-Spam-Status: header are much easier to match on in
filters than numeric values and allow users to provide their own "middle
ground" if they want, e.g. in procmail, something like this:

    :0
    * ^X-Spam-Level: \*\*\*\*\*\*\*\*\*\*
    definitelyspam

    :0
    * ^X-Spam-Level: \*\*\*\*\*
    probablyspam

    :0
    * ^X-Spam-Level: \*\*\*
    probablynotspam

    :0
    * ^X-Spam-Level:
    goodstuff

You should be able to do the same in Outhouse or other mailers.

What this all boils down to is that assuming spambayes/hammie gets beyond
the "extremely cool experiment in algorithm design" stage, multiple ways to
eevaluate its results need to be provided to make it useful to as wide a
range of users as possible.

Which leads to another experiment.  You could scale its probabilities to
match the default scale used by SpamAssassin (>= 5.0 is considered spam) and
compare results from the two.

Skip