[spambayes-dev] artificially tweaking the spam/ham to deal with
n-way scoring
Skip Montanaro
skip at pobox.com
Fri Sep 26 12:06:40 EDT 2003
I'm horsing around with my n-way script a bit and have written a small
training script to help maintain the many training databases. Here's how it
works. Suppose I have four input mailboxes: python, personal, cars and
music, each with a different number of messages (450, 50, 1400, and 50
messages, respectively). I run sb_mboxtrain.py over each one calling all
messages "good", e.g.:
sb_mboxtrain.py -d python-ham.db -g python
sb_mboxtrain.py -d personal-ham.db -g personal
sb_mboxtrain.py -d cars-ham.db -g cars
sb_mboxtrain.py -d music-ham.db -g music
I then create python.db by initializing it with the contents of
python-ham.db, but swap all the counts (treating python-ham.db as all
"spam"), then merge in the other three unchanged. When I'm finished, I have
these figures in the four databases:
db spam ham
python.db 450 1500
personal.db 50 1900
cars.db 1400 550
music.db 50 1900
My nway script then scores messages against those four databases.
As a result of the way I'm building the databases, I can have very extreme
ratios. Again, in real life I want to cluster with more mailboxes (I have
15 at the moment). I haven't thought of a good way to truly balance the
scores and only run sb_mboxtrain once against each mbox file. I'm thinking
I should fudge things. When I merge the multiple smaller databases into a
single daatbase I was thinking I would try scaling the larger values by
enough to satisfy this relationship:
0.5 <= ham count/spam count <= 2.0
Using the python database as an example, I would scale all ham counts by
900/1500, giving a 2-to-1 ham:spam ratio.
Is that roughly what the now defunct imbalance ratio did?
Skip
More information about the spambayes-dev
mailing list