[Spambayes] Current histograms
Tim Peters
tim.one@comcast.net
Wed, 11 Sep 2002 20:13:06 -0400
[Anthony Baxter]
> 5 sets, each of 1800ham/1550spam, just ran the once (it matched all 5 to
> each other...)
>
> rates.py sez:
>
> Training on Data/Ham/Set1 & Data/Spam/Set1 ... 1798 hams & 1548 spams
> 0.445 0.388
> 0.445 0.323
> 2.108 4.072
> 0.556 1.097
> Training on Data/Ham/Set2 & Data/Spam/Set2 ... 1798 hams & 1546 spams
> 2.113 0.517
> 1.335 0.194
> 3.106 5.365
> 2.113 2.903
> Training on Data/Ham/Set3 & Data/Spam/Set3 ... 1798 hams & 1547 spams
> 2.447 0.646
> 0.945 0.388
> 2.884 3.426
> 2.058 1.097
> Training on Data/Ham/Set4 & Data/Spam/Set4 ... 1803 hams & 1547 spams
> 1.057 2.584
> 0.723 1.682
> 0.890 1.164
> 0.445 0.452
> Training on Data/Ham/Set5 & Data/Spam/Set5 ... 1798 hams & 1550 spams
> 0.779 4.328
> 0.501 3.299
> 0.667 3.361
> 0.388 4.977
> total false pos 273 3.03501945525
> total false neg 367 4.74282760403
How were these msgs broken up into the 5 sets? Set4 in particular is giving
the other sets severe problems, and Set5 blows the f-n rate on everything
it's predicting -- when the rates across runs within a training set vary by
as much as a factor of 25, it suggests there was systematic bias in the way
the sets were chosen. For example, perhaps they were broken into sets by
arrival time. If that's what you did, you should go back and break them
into sets randomly instead. If you did partition them randomly, the wild
variance across runs is mondo mysterious.
>> I expect hammie will do a much better job on this already than hand
>> grepping. Be sure to stare at the false positives and get the
>> spam out of there.
> Yah, but there's a chicken-and-egg problem there - I want stuff that's
> _known_ to be right to test this stuff,
Then you have to look at every message by eyeball -- any scheme has non-zero
error rates of both kinds.
> so using the spambayes code to tell me whether it's spam is not
> going to help.
Trust me <wink> -- it helps a *lot*. I expect everyone who has done any
testing here has discovered spam in their ham, and vice versa. Results
improve as you improve the categorization. Once the gross mistakes are
straightened out, it's much less tedious to scan the rest by eyeball.
[on skip tokens]
> Yep, it shows up in a lot of spam, but also in different forms in hams.
> But the hams each manage to pick a different variant of
> ~~~~~~~~~~~~~~~~~~~~~~
> or whatever - so they don't end up counteracting the various bits in the
> spam.
>
> Looking further, a _lot_ of the bad skip rubbish is coming from
> uuencoded viruses &c in the spam-set.
For whatever reason, there appear to be few of those in BruceG's spam
collection. I added code to strip uuencoded sections, and pump out uuencode
summary tokens instead. I'll check it in. It didn't make a significant
difference on my usual test run (a single spam in my Set4 is now judged as
ham by the other 4 sets; nothing else changed). It does shrink the database
size here by a few percent. Let us know whether it helps you!
Before and after stripping uuencoded sections:
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.050 0.050 tied
0.000 0.000 tied
0.025 0.025 tied
0.000 0.000 tied
0.075 0.075 tied
0.025 0.025 tied
0.025 0.025 tied
0.000 0.000 tied
0.050 0.050 tied
0.000 0.000 tied
0.025 0.025 tied
0.000 0.000 tied
0.000 0.000 tied
0.050 0.050 tied
0.025 0.025 tied
0.000 0.000 tied
0.025 0.025 tied
0.050 0.050 tied
won 0 times
tied 20 times
lost 0 times
total unique fp went from 8 to 8 tied
false negative percentages
0.255 0.255 tied
0.364 0.364 tied
0.254 0.291 lost +14.57%
0.509 0.509 tied
0.436 0.436 tied
0.218 0.218 tied
0.182 0.218 lost +19.78%
0.582 0.582 tied
0.327 0.327 tied
0.255 0.255 tied
0.254 0.291 lost +14.57%
0.582 0.582 tied
0.545 0.545 tied
0.255 0.255 tied
0.291 0.291 tied
0.400 0.400 tied
0.291 0.291 tied
0.218 0.218 tied
0.218 0.218 tied
0.145 0.182 lost +25.52%
won 0 times
tied 16 times
lost 4 times
total unique fn went from 89 to 90 lost +1.12%