[Spambayes] How low can you go?
skip at pobox.com
Thu Dec 11 17:13:03 EST 2003
(Seth, are you really nobody at spamcop.net or is that address harvester
fodder? I've been deleting it from my replies because it smells vaguely
like harvester fodder.)
>> Nothing magic or random. I primed the pump one ham and one spam.
>> Then sorted the unsures which arrived by score. Train the lowest
>> scoring spam as spam. ...
Seth> I just re-read this and realized I missed something key in your
Seth> description. Your training set is culled only from unsures,
Seth> rather than the set of all messages.
Well, my description wasn't perfect either. False negatives (and false
positives, should I ever see any) also get trained, *if* the training
between the time they are incorrectly scored and the time I notice them
doesn't push them into spam territory. I use procmail to sort my incoming
mail into 20 or so mailboxes. sb_filter.py is executed very early on, with
messages that score as spam or unsure siphoned off to relevant mailboxes.
The stuff which scores as ham is then further sorted topically.
Consequently, false negatives can go unnoticed for awhile, since they might
be scattered all over the place. I focus most of my training attention on
my unsure mailbox. When I see a false negative I save it in my unsure
mailbox and deal with it the next time I work on that.
Furthermore, since almost all mistakes and unsures are actually spam, I have
to keep an eye on the spam/ham balance in my database, so occasionally stuff
a correctly scored ham into the database. I try to choose messages which
don't score 0.00 (rounded).
Seth> - train on errors + unsures
This is more-or-less what I do, it's just that since my focus is on unsures,
some of the errors may go away before I get a chance to train on them.
Seth> - train on errors + unsures + non-obvious correct decisions
What do you mean by 'non-obvious correct decisions'?
More information about the Spambayes