[Spambayes] RE: spam detection via probability - actual results!

Tim Peters tim.one@comcast.net
Sat, 21 Sep 2002 17:29:20 -0400


[Skip Montanaro]
> I was thinking about this yesterday when I ran rebal because to
> switch from rebalancing ham to rebalancing spam I hit Ctl-P then
> had to edit two places in the line.  A bit late for Tim and
> Anthony, but perhaps the change I just checked in will help
> others avoid this problem.

Oddly enough, it wouldn't have helped me.  It's hard to reconstruct just how
screwed up I got <wink>.

Note that I've checked in another change:  -Q no longer implies -q.  That's
the part that would have helped me, across several steps of screwing up.  I
didn't want to confirm every move, but I had not said -q so the lack of any
output convinced me it wasn't doing anything.  So I kept doing it again and
again, fiddling the parameters each time in what turned out to be disastrous
ways.  The change you made probably would have prevented at least one of the
last screwups.

Heh:  as part of "fixing" this, I managed to move a whole bunch of ham into
my spam *reservoir* too.  So I'd clean out all the ham that a test found in
the spam Sets, ran rebal again, and time after time a new run kept finding
more new ham in the spam, but an ever-decreasing amount.  I thought that
this was just because it was getting trained better; it was really because
my rebals kept putting more ham into the spam.

In the end, I took Neil's suggestion one step further:  I stopped running
tests at all, and just did a grep for python-list over the spam sets.  That
caught 100% of the misclassified ham instantly.

So mixed-source corpora aren't entirely a bad thing either <wink>.