[Spambayes] Current histograms
Anthony Baxter
anthony@interlink.com.au
Wed, 11 Sep 2002 14:23:38 +1000
> How many runs is this summarizing? For each, how many ham&spam were in the
> training set? How many in the prediction sets? What were the error rates
> (run rates.py over your output file)?
5 sets, each of 1800ham/1550spam, just ran the once (it matched all 5 to
each other...)
rates.py sez:
Training on Data/Ham/Set1 & Data/Spam/Set1 ... 1798 hams & 1548 spams
0.445 0.388
0.445 0.323
2.108 4.072
0.556 1.097
Training on Data/Ham/Set2 & Data/Spam/Set2 ... 1798 hams & 1546 spams
2.113 0.517
1.335 0.194
3.106 5.365
2.113 2.903
Training on Data/Ham/Set3 & Data/Spam/Set3 ... 1798 hams & 1547 spams
2.447 0.646
0.945 0.388
2.884 3.426
2.058 1.097
Training on Data/Ham/Set4 & Data/Spam/Set4 ... 1803 hams & 1547 spams
1.057 2.584
0.723 1.682
0.890 1.164
0.445 0.452
Training on Data/Ham/Set5 & Data/Spam/Set5 ... 1798 hams & 1550 spams
0.779 4.328
0.501 3.299
0.667 3.361
0.388 4.977
total false pos 273 3.03501945525
total false neg 367 4.74282760403
> The effect of set sizes on accuracy rates isn't known. I've informally
> reported some results from just a few controlled experiments on that.
> Jeremy reported improved accuracy by doubling the training set size, but
> that wasn't a controlled experiment (things besides just training set size
> changed between "before" and "after").
I'll try with 2 sets with half the messages each.
> Yup, tagging data is mondo tedious, and mistakes hurt.
>
> I expect hammie will do a much better job on this already than hand
> grepping. Be sure to stare at the false positives and get the spam out of
> there.
Yah, but there's a chicken-and-egg problem there - I want stuff that's
_known_ to be right to test this stuff, so using the spambayes code to
tell me whether it's spam is not going to help.
> With probabilities favoring ham or spam? A skip token is produced in lieu
> of "word" more than 12 chars long and without any high-bit characters. It's
> possible that they helped me because raw HTML produces lots of these.
> However, if you're running current CVS, Tokenizer/retain_pure_html_tags
> defaults to False now, so HTML decorations should vanish before body
> tokenization.
Yep, it shows up in a lot of spam, but also in different forms in hams.
But the hams each manage to pick a different variant of
~~~~~~~~~~~~~~~~~~~~~~
or whatever - so they don't end up counteracting the various bits in the
spam.
Looking further, a _lot_ of the bad skip rubbish is coming from uuencoded
viruses &c in the spam-set.
Anthony