[spambayes-dev] RE: [Spambayes] How low can you go?

Mon Dec 29 15:37:17 EST 2003

[T. Alexander Popiel]
> ...
> Yup.  I have a nice picture now of the ratio over time at the bottom
> of the report at:
> http://www.wolfskeep.com/~popiel/spambayes/nonedge

Hmm.  That appears to be using a log scale for the Y (ratio) axis, so what
*appears* to be straight-line growth in the ratio after about day 150 is
really exponential growth.  That could get bad over time <wink>.

> ...
> Interestingly enough, though, the nonedge did better than TOE, despite
> a worse imbalance.

Yup, I saw that.

>> So if I had your data, I'd be curious to try variations that force
>> better balance.

> I'd love to... but I haven't been able to come up with anything which
> maintains the balance better without extreme artificiality.  If you
> think of any regimes that make sense, I'd be more than happy to run
> them.

Oh, there are billions of things that could be tried.  Who knows what might
pay?  Picking just enough edge ham  at random for training to force balance
is one idea.  The definition of "nonedge" is arbitrarily mutable too:
there's nothing a priori compelling about "0.00 or 1.00 after rounding to 2
decimal digits after the radix point".  For example, maybe it's better to
use 3 decimal digits, or 1, or maybe it's really best to use 2 digits after
the radix point when the score is expressed in base 7 <wink -- but "two
decimal digits" is just an artifact of how scores get displayed>.
Asymmetric bounds also have some attraction, since, e.g., in mistake-based
training "by hand" I always end up moving the ham cutoff closer to 0 than
the spam cutoff is to 1.  IOW, empirically, in my own email mix, and based
on one kind of lazy training, my region of certainty for ham is smaller than
my region of certainty for spam.  This makes some sense to me, since my ham
is more uniform than my spam.

Heh.  Except at Christmas, and probably through the first week of next year,
when I get piles of msgs from people I only hear from once a year.