[Spambayes] Current histograms
Wed, 11 Sep 2002 14:23:38 +1000
> How many runs is this summarizing? For each, how many ham&spam were in the
> training set? How many in the prediction sets? What were the error rates
> (run rates.py over your output file)?
5 sets, each of 1800ham/1550spam, just ran the once (it matched all 5 to
Training on Data/Ham/Set1 & Data/Spam/Set1 ... 1798 hams & 1548 spams
Training on Data/Ham/Set2 & Data/Spam/Set2 ... 1798 hams & 1546 spams
Training on Data/Ham/Set3 & Data/Spam/Set3 ... 1798 hams & 1547 spams
Training on Data/Ham/Set4 & Data/Spam/Set4 ... 1803 hams & 1547 spams
Training on Data/Ham/Set5 & Data/Spam/Set5 ... 1798 hams & 1550 spams
total false pos 273 3.03501945525
total false neg 367 4.74282760403
> The effect of set sizes on accuracy rates isn't known. I've informally
> reported some results from just a few controlled experiments on that.
> Jeremy reported improved accuracy by doubling the training set size, but
> that wasn't a controlled experiment (things besides just training set size
> changed between "before" and "after").
I'll try with 2 sets with half the messages each.
> Yup, tagging data is mondo tedious, and mistakes hurt.
> I expect hammie will do a much better job on this already than hand
> grepping. Be sure to stare at the false positives and get the spam out of
Yah, but there's a chicken-and-egg problem there - I want stuff that's
_known_ to be right to test this stuff, so using the spambayes code to
tell me whether it's spam is not going to help.
> With probabilities favoring ham or spam? A skip token is produced in lieu
> of "word" more than 12 chars long and without any high-bit characters. It's
> possible that they helped me because raw HTML produces lots of these.
> However, if you're running current CVS, Tokenizer/retain_pure_html_tags
> defaults to False now, so HTML decorations should vanish before body
Yep, it shows up in a lot of spam, but also in different forms in hams.
But the hams each manage to pick a different variant of
or whatever - so they don't end up counteracting the various bits in the
Looking further, a _lot_ of the bad skip rubbish is coming from uuencoded
viruses &c in the spam-set.