[spambayes-dev] contrib/findbest.py

Skip Montanaro skip at pobox.com
Wed Jan 21 16:44:47 EST 2004


I just checked findbest.py into the contrib directory.  Here's the
docstring.

    Find the next "best" unsure message to train on.

        %(prog)s [ -h ] [ -s ] [ -b N ] ham spam unsure

    Given a number of unsure messages and a desire to keep your training
    database small, the question naturally arises, "Which message should I
    add to my database next?".  A common approach is to sort the unsures by
    their SpamBayes scores and train on the one which scores lowest.  This
    is a reasonable approach, but there is no guarantee the lowest scoring
    unsure is in any way related to the other unsure messages.

    This script offers a different approach.  Given an existing pile of ham
    and spam, it trains on them to establish a baseline, then for each
    message in the unsure pile, it trains on that message, scores the entire
    unsure pile against the resulting training database, then untrains on
    that message.  For each such message the following output is generated:

        * spamprob of the candidate message

        * number of other unsure messages which would score as spam if it
          was added to the training database

        * overall mean of all scored messages after training

        * standard deviation of all scored messages after training

        * message-id of the candidate message

    With no options, all candidate unsure messages are trained and scored
    against.  At the end of the run, a file, "best.pck" is written out which
    is a dictionary keyed by the overall mean rounded to three decimal
    places.  The values are lists of message-ids which generate that mean.

    Three options affect the behavior of the program.  If the -h flag is
    given, this help message is displayed and the program exits.  If the -s
    flag is given, no messages which score as spam are tested as candidates.
    If the -b N flag is given, only the messages which generated the N
    highest means in the last run without the -b flag are tested as
    candidates.  Because the program runtime can be very slow (O(n^2) in the
    number of unsure messages), if you have a fairly large pile of unsure
    messages, these options can speed things up dramatically.  If the -b
    flag is used, a new "best.pck" file is not written.  Typically you would
    run once without the -b flag, then several times with the -b flag,
    adding one message to the spam pile after each run.  After adding
    several messages to your spam file, you might then redistribute the
    unsure pile to move spams and hams to their respective folders, then
    start again with a smaller unsure pile.

    The ham, spam and unsure command line arguments can be anything suitable
    for feeding to spambayes.mboxutils.getmbox().  The "best.pck" file is
    searched for and written to these files in this order:

        * best.pck in the current directory

        * $HOME/tmp/best.pck

        * $HOME/best.pck

    [To do?  Someone might consider the reverse operation.  Given a pile of
    ham and spam, which message can be removed with the least impact?  What
    pile of mail should that removal be tested against?]

I'm sure there are mistakes in there.  Feel free to rap my virtual knuckles
or fix them...

Skip



More information about the spambayes-dev mailing list