[Spambayes] training problem?

Skip Montanaro skip at pobox.com
Wed Dec 3 16:32:25 EST 2003


    Seth> Nice Wiki work.  Another difference I see between your approach
    Seth> and my previous one is that you trained on 30 days worth of spam.
    Seth> I was afraid to do that since I get 140 spam/day, so 30 days worth
    Seth> is 4,200 messages.  To get that much ham, I would need to go back
    Seth> almost 6 months.  However, maybe that long of a history for spam
    Seth> is what it takes to get good detection.  I'm amazed at your low
    Seth> unsure rate.

I really doubt that anyone needs to train on every single spam message which
comes through in a 30-day period.  Most spam probably comes from a small
handful of cretins, and spam from the same cretin seems to arrive in bunches
(gotta make full use of a new account before it gets shut off).
Consequently, training on a single spammy unsure message is often sufficient
to nudge several messages of of the unsure region and into spam territory.

I've appended a small script I use to help decide which spams and hams that
turn up "unsure" I should train on first/next.  I run a mailbox through
sb_filter.py like so:

    sb_filter.py ~/Mail/unsure | python ~/tmp/scan.py

The scan.py script spits out the subject, message-id, date and
classification headers sorted by score.  By default, it only considers
messages classified as "unsure".  You can force it to consider any/all
combinations though, e.g.:

    sb_filter.py ~/Mail/unsure | python ~/tmp/scan.py 'ham|spam|unsure'

The idea is that you train on one or a few of your lowest scoring spams
and/or highest scoring hams, save your unsure file, then run the above
again.  Any previously "unsure" spams which now show up at the spam end of
things get ignored.  Lather, rinse, repeat.  When you're tired of the
cleansing cycle (or your hair is squeaky clean), rename your unsure folder,
e.g.:

    mv ~/Mail/unsure ~/Mail/unsure.save

then train on it again, e.g.:

    formail -s procmail < ~/Mail/unsure.save

The above commands are what I use in my Unix-y/procmail-ish/sb_filter-laden
environment.  You will obviously have to adjust them according to the needs
of your environment, but the basic idea is the same everywhere.  I think
this process is even easier in the Outlook plugin.  Sort your unsure folder
by score, move a small number of the most out-of-whack messages where they
belong, then reclassify your unsure folder.

Skip

    #!/usr/bin/env python

    import sys, re, getopt

    msgid = date = cls = ""
    sub = "<no subject>"

    scanfor = "unsure"

    opts, args = getopt.getopt(sys.argv[1:], "")
    if args:
        scanfor = '|'.join(args)

    info = []
    for line in sys.stdin:
        if line.startswith("From "):
            msgid = date = cls = ""
            sub = "<no subject>"
        elif line.lower().startswith("subject: "):
            sub = line.strip()
        elif line.lower().startswith("message-id: "):
            msgid = line.strip()
        elif line.lower().startswith("date: "):
            date = line.strip()
        elif line.lower().startswith("x-spambayes-classification: "):
            cls = line.strip()
            if re.search(scanfor, cls) is not None:
                prob = float(cls.split()[-1])
                info.append((prob, (sub, date, msgid, cls)))
            date = msgid = cls = ""
            sub = "<no subject>"

    info.sort()
    for (prob, (sub, date, msgid, cls)) in info:
        print
        print sub
        print date
        print msgid
        print cls




More information about the Spambayes mailing list