[spambayes-dev] lowest scoring message isn't always "best" one to train on

Mon Jan 19 10:44:13 EST 2004

Based on a suggestion by Eli Stevens, over the weekend I decided to try
burning some electrons to decide which message to train on next given a pile
of unsures and false negatives (I haven't got any false positives lying
around).  The script I came up with takes three inputs: a pile of hams, a
pile of spams, and a pile of unsures/fn's.  It trains on the hams and spams,
then for each message in the unsure/fn pile does this:

    for msg in getmbox(unsures):
        h.train(msg)
        newspams = 0
        for trial in getmbox(unsures):
            prob = cls.spamprob(trial):
            if prob > spam_cutoff:
                newspams += 1
        h.untrain(msg)
        print trial['message-id'], newspams

As you can see, since it's O(n*n) in the number of unsures, it's not a
script to be run casually with a large number of unsures (alas, that's how
I've been running it).  I have a little more code in there to avoid scoring
messages which already score as spam and to limit a scoring run to the best
candidates from a previous run, but it can still take awhile to run.

It pointed out something interesting, however: if you want the most bang for
your buck (push the most messages into the spam region), the best message to
train on often seems to be a message with a fairly high score.  Here's a
snippet of output from the start of my latest run:

    0.032 <mslcyc760780 at rocketmail.com> 4
    0.321 <27412761818.707072765454907 at python.org> 3
    0.539 <4005795A000C4777 at occmta11a.terra.com.mx> (added by
       postmaster at emailcluster.terra.com.mx) 5
    0.872 <200401180345.i0I3jvCd013462 at manatee.mojam.com> 3
    0.682 <s930$209a$6o8t at a4maf1.bz> 6
    0.869 <5-3w2y$o80o688x-5z0wp58v9h4hi at pm5mn> 6
    0.846 <3$1fkv63$0-4u-sn04 at otdgq.l1.43isz> 6
    0.880 <192k46ax$r$lt at 6bslldmd> 12
    0.891 <q$p$hg95$-$$590x7f$67$1h9--g$d3 at zqh.9dz0v> 15
    0.875 <ord68z$-si366--cm1$4b$l at woa3f2.1m4e22> 11
    0.798 <2$-44$$2$mymd27 at ulkkm64> 12
    0.195 <20040118052804.NDHS11926.out009.verizon.net at terrapin> 10

Note that the first item has a very low spamprob itself, but of the bunch I
displayed, the best ones to train on to push the most other spams into spam
range all score around 0.8 to 0.9.  (I currently have my cutoffs set at 0.1
and 0.9).  I suspect this is because I'm selecting messages based on their
similarity to lots of other messages in the unsure pile (many of which may
already have fairly high unsure scores), so the score of the newly trained
message is somewhat less unimportant than its similarity to other unsures.

Skip

P.S.  As an aside, note the message-id for the third message above has
"(added by postmaster at ...").  I have seen annotations like that a few times.
Is it still a valid message-id (from an rfc-2822 standpoint)?  It seems like
it would be a fairly objective feature to extract from messages.

S