[Spambayes] More web interface statistics

David Abrahams dave at boost-consulting.com
Fri Apr 27 13:58:30 CEST 2007


on Fri Apr 27 2007, David Abrahams <dave-AT-boost-consulting.com> wrote:

> OK, this is really weird.  I have reasonably balanced spam and ham
> folders (within about 15 messages of one another). I just used tte.py
> to train them, without any ratio option, so it should have actually
> been a balanced number of messages.  Yet, when I run sb_imapfilter.py,
> I see:
>
> $ sb_imapfilter.py -v
> Loading state from /home/dave/spambayes/hammie.fs database
>
> /home/dave/spambayes/hammie.fs is an existing database, 
> with 282 spam and 76 ham
>      ^^^^^^^^^^^^^^^^^^^
>                                 
> What do those ham/spam numbers really mean?

I have at least part of an answer:

$ tte.py ...
Loading state from /home/dave/spambayes/hammie.new.fs database
/home/dave/spambayes/hammie.new.fs is a new database
round:  1, msgs:  822, ham misses:  68, spam misses: 222, 73.4s
round:  2, msgs:  822, ham misses:   8, spam misses:  56, 24.5s
round:  3, msgs:  822, ham misses:   0, spam misses:   4, 20.6s
round:  4, msgs:  822, ham misses:   0, spam misses:   0, 19.7s
************************************16 untrained spams

68+8 = 76
222+56+4 = 282

So, somehow, the number of hams or spams "in the database" really has
to do with the number that are found to be misclassified and thus
influence the training data?

It's hard to understand the importance of keeping ham and spam
balanced if one or the other can ultimately influence training so much
more than the other.

-- 
Dave Abrahams
Boost Consulting
http://www.boost-consulting.com

Don't Miss BoostCon 2007! ==> http://www.boostcon.com



More information about the SpamBayes mailing list