[Spambayes] Frequency distribution for wordinfo counts?

Skip Montanaro skip at pobox.com
Mon Feb 23 20:04:40 EST 2004

    >> I'm coming into this late, but thought I'd post my numbers.  As far
    >> as I know, I'm the only person using train-to-exhaustion at the
    >> moment.  That probably skews my numbers, so maybe they'll be of
    >> interest.

    Tony> How well is this working for you?  Is it really slow?  Do you have
    Tony> it set to only use a subset of mail, or is it
    Tony> training-to-exhaustion on the whole lot?

Seems to work pretty well.  Here's a run I did just now:

    % python ~/tmp/spambayes/contrib/tte.py -g newham.clean.save -s newspam.clean.save -d tte.db 
    round:  1, msgs:  770, ham misses: 196, spam misses: 244, 67.7s
    round:  2, msgs:  770, ham misses:  33, spam misses:  55, 49.4s
    round:  3, msgs:  770, ham misses:   8, spam misses:   5, 33.1s
    round:  4, msgs:  770, ham misses:   0, spam misses:   0, 28.6s
    1 untrained spams

Adding up the last column indicates a total run time of about three minutes.
I can live with that.

Note that even though I fed it 770 messages, only 541 messages (some of them
were duplicates) actually contributed to the final database:

    % spamcounts -d tte.db 'saved state'
    db: tte.db
    token,nspam,nham,spam prob
    saved state,304,237,0.5

The database thus winds up smaller than it would be with a more usual
training approach.


More information about the Spambayes mailing list