[Spambayes] training problem?
sethg at GoodmanAssociates.com
Tue Dec 2 23:45:24 EST 2003
> What do you mean by false negative? We use it here to mean spam scoring
> below your ham cutoff.
That's exactly what I meant by it. I don't count an unsure as a false
negative, and I don't mind seeing unsures. Most of the false negatives were
spam that scored 0% or 1%. Incidentally, *all* of my ham scores either 0%
or 1% with the great preponderance at 0%. That is why I later moved my ham
threshold from 15 down to 5.
> > 1) Initial training set 650 spam, 654 ham on 11-16-03.
> > 2) Initial filter thresholds 90/15.
> So by "false negative" here you mean spam scoring below 15? If so, I have
> no theory, as I see maybe one of those per month (with about 700
> emails per
> day, including 200-250 daily spam).
Exactly. At the outset of this experiment, a false negative was any spam
scoring below 15. Most of them scored very close to 0, just like my ham.
>From the numbers in my results table, you can see that I get between 5-10 of
these per day with a spam load of around 140 per day.
> > The two messages I posted about in this thread were just examples.
> One would have sufficed <wink>.
Point well taken. Sorry for wasting everyone's bandwidth.
> > 3) Train on any spam that scores below 50, any ham that scores above
> > 15. Filter all unread mail after each training event to simulate
> If your spam cutoff is 90, why do you only train on spam scoring below 50?
> Something doesn't sound right here.
Yes, I agree it sounds fishy, but read on. My spam cutoff is 90, and I did
decide to only train on spam that scored less than 50 for this run. In
previous runs, I trained on all errors and all unsures using the default
thresholds of 90/15. All the unsures were spam (every single one of them),
and the unsures were numerous, so my spam database grew quickly. I trained
extra ham periodically to balance it, but I still had a high false negative
rate. This run was an experiment in training less (to see if that was the
problem) by only training when the classifier was *really* wrong as opposed
to training on all the unsures. I picked 50% as the cutoff for being really
wrong since a completely "neutral" set of words would have an expected score
of 50%. In any case, the experiment was a failure since my false negative
rate is about the same as it was when I trained on all errors and all
> Sorry, still don't know what you mean by false negative. If you meant the
> conventional "scored below 15" (your former ham cutoff), yet
> very, very few
> of them scored between 5 and 15, it must mean that almost all of
> your false
> negatives are scoring below 5. Is that what you mean?
Yes! It's sad but true. Most of these false negatives had the same scores
> Ditto. My own FP and FN rates are trivial (I'm genuinely surprised to see
> any spam in my Inbox, and shocked to see a ham in my Spam folder, using
> cutoffs of 20 and 80). My Unsure rate (scores between 20 and 80)
> is heading
> toward 5% -- but I don't care (I review all my spam anyway, and I'm on
> enough admin-type mailing lists that I get a ton of weird email -- I can't
> myself decide whether fully half the stuff in my Unsure folder is "really
> ham" or "really spam", and toss it untrained after mentally shrugging).
I would be delighted if my system performed like that. Like you, I also
don't care how many unsures I get. Since the system *says* it's unsure, I
*will* look at those messages. I didn't track the number, but I think
unsures amounted to 15-20% of my spam.
The 5-10 spam in my Inbox with a score at or near zero, however, does bother
me. A couple of them are "newsletters that won't quit" types, and I can
understand the classifier having trouble with them. They don't have any
sales jargon, they just don't bother with unsubscribe requests. If I wasn't
into experimenting with SpamBayes like this, I would just kill them off by
sender. However, some of the spam that scores near zero is the real
stomach-emptying stuff that I would have guessed have enough spammy words to
light up the magic light bulb very brightly. There's also the 419 stuff
that SpamBayes does not seem to catch, for whatever reason.
> Until we know you meant by false negative, none. If you're calling spam
> that ends up Unsure "false negative", then reducing your spam
> cutoff should
> help. If you really are getting lots of spam scoring below 5, then that's
> something I've never heard of before (anyone?).
It looks like this is a case you've never seen before, which is not good
news. I can send any files that you care to see and will do any experiments
that you suggest. I have also retained the entire message stream for the
duration of this experiment.
My assumption is that my results have something to do with my training
tactics: the initial training set size, the thresholds that trigger me to
train, how far out of balance I let the databases get before I add more ham,
etc. Do you think my initial training set (around 650 of each) was too
large? The next time I start over, I plan to use thresholds of 80/5. Do
you recommend any particular initial training set size?
The other configuration stuff that may or may not matter is:
- SpamBayes Outlook Plug-In 0.81, clean install
- Outlook 2000 SP-3
- Windows 2000 Pro SP-4
- mail fetched from two POP3 servers every five minutes
- Outlook rules move all legit mailing list stuff out of the Inbox
- background mode set with start delay = 2.0 sec, delay between messages =
- only the Inbox is watched
Humans: personal replies to sethg [at] GoodmanAssociates [dot] com
Spambots: disregard the above
More information about the Spambayes