[spambayes-dev] RE: [Spambayes] question regarding training

Fri Aug 13 07:17:47 CEST 2004

In message:  <ECBA357DDED63B4995F5C1F5CBE5B1E86C5492 at its-xchg4.massey.ac.nz>
             "Tony Meyer" <tameyer at ihug.co.nz> writes:

>True, train-on-mistakes might not reduce the imbalance compared to
>train-on-everything.  This would only be true if the percentage of
>mistakes that are spam is lower than the percentage of incoming mail
>that is spam.  I should really have used "might" instead of "should"
>there.  In some cases, it will, however.
>
>The imbalance almost certainly will grow less quickly, though, because
>the database size will grow much, much slower.

I don't believe this assertion.  Sure, the raw counts of the trained
ham and spam will grow more slowly, but the relative imbalance will
(based on observation of my mail data) grow even more rapidly in
reduced-training regimes.

I use TOAE (train-on-almost-everything, or all spam < .995 and all
ham > .005), not TOM, but I can say that the imbalance in my training
set is significantly higher than the imbalance in my incoming mail.
Specifically, in the retrain from last night (using the last 4 months
of my incoming mail):

  Total:    3367 ham, 33851 spam (90.95% spam)
  Trained:   225 ham, 17421 spam (98.72% spam)

Also, I have in that time period:

  Unsure:     83 ham,  8843 spam (99.07% spam)
  Errors:      2  fp,   872  fn  (99.77% spam)

This shows that if I was training on errors, my imbalance would be
even worse than it currently is.  (It also shows that I really need
to tune my cutoffs - they're currently at the defaults.)

I have in the past suggested that the ideal imbalance is related to
the number of distinct 'topics' in each category.  For instance,
there's only about 4 topics in my ham: spambayes discussions, pennmush
discussions, administrative mails, and idle chatter from my friends.
On the other hand, there's many more topics in my spam: delivery
errors caused by virus joe-jobs, sexual enhancement, mortgage loans,
weight reduction, nigerian-style scams, must-have lawn ornaments,
stock pick of the scammer, chain letters, this-is-not-a-marketing-pyramid,
etc.  Unfortunately, I don't have enough AI knowledge to rigorously
categorize these and quantify the relationship between topics and
training imbalance.

- Alex