[spambayes-dev] artificially tweaking the spam/ham to deal withn-way scoring

Sun Sep 28 22:32:04 EDT 2003

[Skip]
> I'm horsing around with my n-way script a bit and have written a small
> training script to help maintain the many training databases.  Here's
> how it works.  Suppose I have four input mailboxes: python, personal,
> cars and music, each with a different number of messages (450, 50,
> 1400, and 50 messages, respectively).  I run sb_mboxtrain.py over
> each one calling all messages "good", e.g.:
>
>     sb_mboxtrain.py -d python-ham.db -g python
>     sb_mboxtrain.py -d personal-ham.db -g personal
>     sb_mboxtrain.py -d cars-ham.db -g cars
>     sb_mboxtrain.py -d music-ham.db -g music
>
> I then create python.db by initializing it with the contents of
> python-ham.db, but swap all the counts (treating python-ham.db as all
> "spam"), then merge in the other three unchanged.  When I'm finished,
> I have these figures in the four databases:
>
>     db          spam            ham
>     python.db    450            1500
>     personal.db   50            1900
>     cars.db     1400             550
>     music.db      50            1900
>
> My nway script then scores messages against those four databases.
>
> As a result of the way I'm building the databases, I can have very
> extreme ratios.  Again, in real life I want to cluster with more
> mailboxes (I have 15 at the moment).  I haven't thought of a good way
> to truly balance the scores and only run sb_mboxtrain once against
> each mbox file.  I'm thinking I should fudge things.  When I merge
> the multiple smaller databases into a single daatbase I was thinking
> I would try scaling the larger values by enough to satisfy this
> relationship:
>
>     0.5 <= ham count/spam count <= 2.0
>
> Using the python database as an example, I would scale all ham counts
> by 900/1500, giving a 2-to-1 ham:spam ratio.
>
> Is that roughly what the now defunct imbalance ratio did?

Yup, but your way isn't as extreme.

In all cases the spamprob we use is a weighted average of 0.5 and a
by-counting spamprob guess.  The imbalance adjustment just reduced the
weight on the by-counting spamprob guess, to what it would have been if we
had an equal number of ham and spam msgs.

Your adjustment doesn't change the by-counting spamprob guess either (well,
it does, but due to quantization error:  multiplying a count by 900/1500 =
0.6 usually won't yield an exact integer, and you'll have to lose the info
in the fractional part to store the result as an integer).  Your adjustment
also reduces the weight, but not as drastically.