[spambayes-dev] RE: [Spambayes] How low can you go?
skip at pobox.com
Thu Dec 18 08:41:11 EST 2003
Tim> I'm experimenting with a mixed unigram/bigram classifier right now.
Tim> It's been trained on (just) 94 ham and 96 spam so far, but there
Tim> are already 51,378 features in the database. 45,624 of them are
Tim> hapaxes -- that's 89%!
Late yesterday afternoon I tweaked my procmailrc file to automatically train
on everything which scored as ham or spam. I awoke this morning to a
database with 489 spam, 600 ham and 198,747 features, 158,116 of were
hapaxes (80%). At the same time I moved my ham/spam thresholds closer to 0
and 1 to minimize the amount of retraining necessary to counteract false
positives and false negatives. (It's kind of a pain because I'm also saving
the messages I train on, so I have to rummage around in a Unix mbox to find
incorrectly trained messages.) I train unsures by hand. Still only 16
unsures overnight, but my database is up to 10.5MB, so training and scoring
time is on the rise.
Bringing it back to this topic, hapax expiration seems like both a
worthwhile step to take from space/time considerations, and even less likely
to produce problems because I'm training on everything I see.
Now if I could only test this setup easily without a huge time investment.
Perhaps a few more Emacs keybindings are in order.
Tim> BTW, the single worst thing you can do with a system of this type
Tim> is train a message into the wrong category. Everyone does it
Tim> eventually, and some people can't seem to help but doing it often.
More information about the Spambayes