[Spambayes] How low can you go?

Wed Dec 10 11:50:47 EST 2003

    >> At the moment I have trained on 14 spams and 20 hams and am quite
    >> pleased with how its performing so far.  I've received mail for a
    >> half dozen or so different mailing lists, and it's catching spams
    >> left and right.  I anticipate a slew of unsures overnight as I get
    >> new kinds of email (both ham and spam), but I will be damned
    >> selective about what I add to my database.

    Seth> OK, I'll bite.  How did you select those 14 spams and 20 hams?
    Seth> Just please don't say they're random.  Even if you have to lie.

Nothing magic or random.  I primed the pump one ham and one spam.  Then
sorted the unsures which arrived by score.  Train the lowest scoring spam as
spam.  Now rescore the unsure mailbox only considering messages which are
now scored as spam.  Delete them.  Lather.  Rinse.  Repeat.  You will
obviously have many hams which initially score as unsure as well.  Do the
same thing for them, just start from the highest scoring ham.

I awoke to 96 unsures this morning.  I did the above dance for awhile.  I'm
now up to 43 spams and 35 hams.  I still have a few messages in my unsure
mailbox which score between 0.30 and 0.58, but with such a small database I
don't want to overload the spam side of things.  I'll wait until I get a few
more hams.  Note that to keep the database more-or-less in balance, I do
train on the occasional ham, though I try to find ones that score at the
higher end of the ham region.

    Seth> Perhaps you selected them by incrementally training on a corpus of
    Seth> 100 each?

No starting corpus other than mail as it arrived and the two initial pump
primers.  They were recently received messages as well though.  I just
wanted something to keep the initial scores from all being 0.50.

    Seth> What are your current thresholds?  

0.15 and 0.60.  I moved the spam threshold from 0.65 this morning.

    Seth> I would expect a lot of unsures, which doesn't bother me a bit,
    Seth> but what are you seeing (so far) for false positives and false
    Seth> negatives?

A few.  I haven't seen any false positives so far.  Perhaps five false
negatives.  I think the system does a good job vis a vis false positives
because most people's ham tends to be topically very similar.  On the other
hand, spam is all over the map, both as far as its content is concerned, as
well as the mechanisms of the delivery process (hiding delivery routes in
various ways, obscuring content, etc), so it's understandable that spam is
harder to classify.  I think it also helps to explain why my ham/spam
thresholds can be so assymetric and still be effective.

Note: When I encounter a false negative I don't automatically train on it.
Instead, I move it to my unsure mailbox.  Since it arrived and was
incorrectly scored as ham, I may have done enough training on unsures to now
correctly classify it as spam, so training on it won't help much.  To be
most accurate, I should look for false negatives and false positives before
considering my unsure mailbox (since they are the most egregious mistakes),
but that means I have to skim 20 mailboxes looking for mistakes.  I'm more
than happy to just deal with false negatives when I encounter them during my
regular mail reading.

    Seth> Damned impressive, if you ask me.

I think so too.  (Not my training technique, SpamBayes.)

Skip