[spambayes-dev] artificially tweaking the spam/ham to deal with n-way scoring

Skip Montanaro skip at pobox.com
Fri Sep 26 12:06:40 EDT 2003


I'm horsing around with my n-way script a bit and have written a small
training script to help maintain the many training databases.  Here's how it
works.  Suppose I have four input mailboxes: python, personal, cars and
music, each with a different number of messages (450, 50, 1400, and 50
messages, respectively).  I run sb_mboxtrain.py over each one calling all
messages "good", e.g.:

    sb_mboxtrain.py -d python-ham.db -g python
    sb_mboxtrain.py -d personal-ham.db -g personal
    sb_mboxtrain.py -d cars-ham.db -g cars
    sb_mboxtrain.py -d music-ham.db -g music

I then create python.db by initializing it with the contents of
python-ham.db, but swap all the counts (treating python-ham.db as all
"spam"), then merge in the other three unchanged.  When I'm finished, I have
these figures in the four databases:

    db          spam            ham
    python.db    450            1500
    personal.db   50            1900
    cars.db     1400             550
    music.db      50            1900

My nway script then scores messages against those four databases.

As a result of the way I'm building the databases, I can have very extreme
ratios.  Again, in real life I want to cluster with more mailboxes (I have
15 at the moment).  I haven't thought of a good way to truly balance the
scores and only run sb_mboxtrain once against each mbox file.  I'm thinking
I should fudge things.  When I merge the multiple smaller databases into a
single daatbase I was thinking I would try scaling the larger values by
enough to satisfy this relationship:

    0.5 <= ham count/spam count <= 2.0

Using the python database as an example, I would scale all ham counts by
900/1500, giving a 2-to-1 ham:spam ratio.

Is that roughly what the now defunct imbalance ratio did?

Skip



More information about the spambayes-dev mailing list