[Spambayes] There Can Be Only One

Tim Peters tim.one@comcast.net
Tue, 24 Sep 2002 22:21:44 -0400


[Guido]
> Here's a first result.  I compared baseline Graham (empty .ini) with
> Gary according to your suggestion, but with spam_cutoff set to 0.625

Wow!  0.625 is the largest reported "best value" to date (Neil and I both
needed 0.6 once; the suggested 0.55 was optimal for my run -- which is why I
suggested it <wink>).

> (after looking at the histograms of a trial run).
>
> Net result:
>
> Graham: 8 fp's, 13 fn's.
> Gary:   7 fp's, 13 fn's.
>
> I ran the same two tests with a different random number:
>
> Graham: 8 fp's, 11 fn's.
> Gary:   6 fp's, 16 fn's.
>
> According to the histogram, Gary would have given 8 fp's and 10 fn's
> with a cutoff of 0.6, again beating Graham with the smallest margin.

Have you used cmp.py?  It generates side-by-side listings comparing two
runs, and it's extremely important to know how often each scheme beat the
other; the summary numbers reveal nothing about that.  For this particular
test, the tail end of the cmp.py output (with the changes in ham and spam
score means and sdevs) is useless.  When tweaking paramters for a single
scheme, though, all the cmp.py output is dripping with valuable clues.

> I found one more spam in my ham (but didn't remove it between these
> runs).  I also found 10 empty messages in Bruce Guenter's spam
> archives!  (1 in 2002/01, 2 in 2002/05, 7 in 2002/06.)  Tim, I presume
> you cleaned these out long ago?

No, I've left empty messages in both my corpora.  Although I'm unclear on
what you mean by "empty".  I mean they have no body, but do have message
headers.  Sometimes a c.l.py ham consists solely of a question in the
Subject line!

> I'm going to clean these out, start over, and try to tweak Gary's
> parameters.

Bless you.