[Spambayes] Current histograms

Anthony Baxter anthony@interlink.com.au
Wed, 11 Sep 2002 14:23:38 +1000


> How many runs is this summarizing?  For each, how many ham&spam were in the
> training set?  How many in the prediction sets?  What were the error rates
> (run rates.py over your output file)?

5 sets, each of 1800ham/1550spam, just ran the once (it matched all 5 to
each other...)

rates.py sez:

Training on Data/Ham/Set1 & Data/Spam/Set1 ... 1798 hams & 1548 spams
      0.445   0.388
      0.445   0.323
      2.108   4.072
      0.556   1.097
Training on Data/Ham/Set2 & Data/Spam/Set2 ... 1798 hams & 1546 spams
      2.113   0.517
      1.335   0.194
      3.106   5.365
      2.113   2.903
Training on Data/Ham/Set3 & Data/Spam/Set3 ... 1798 hams & 1547 spams
      2.447   0.646
      0.945   0.388
      2.884   3.426
      2.058   1.097
Training on Data/Ham/Set4 & Data/Spam/Set4 ... 1803 hams & 1547 spams
      1.057   2.584
      0.723   1.682
      0.890   1.164
      0.445   0.452
Training on Data/Ham/Set5 & Data/Spam/Set5 ... 1798 hams & 1550 spams
      0.779   4.328
      0.501   3.299
      0.667   3.361
      0.388   4.977
total false pos 273 3.03501945525
total false neg 367 4.74282760403

> The effect of set sizes on accuracy rates isn't known.  I've informally
> reported some results from just a few controlled experiments on that.
> Jeremy reported improved accuracy by doubling the training set size, but
> that wasn't a controlled experiment (things besides just training set size
> changed between "before" and "after").

I'll try with 2 sets with half the messages each.

> Yup, tagging data is mondo tedious, and mistakes hurt.
> 
> I expect hammie will do a much better job on this already than hand
> grepping.  Be sure to stare at the false positives and get the spam out of
> there.

Yah, but there's a chicken-and-egg problem there - I want stuff that's
_known_ to be right to test this stuff, so using the spambayes code to
tell me whether it's spam is not going to help.


> With probabilities favoring ham or spam?  A skip token is produced in lieu
> of "word" more than 12 chars long and without any high-bit characters.  It's
> possible that they helped me because raw HTML produces lots of these.
> However, if you're running current CVS, Tokenizer/retain_pure_html_tags
> defaults to False now, so HTML decorations should vanish before body
> tokenization.

Yep, it shows up in a lot of spam, but also in different forms in hams. 
But the hams each manage to pick a different variant of 
~~~~~~~~~~~~~~~~~~~~~~
or whatever - so they don't end up counteracting the various bits in the
spam.

Looking further, a _lot_ of the bad skip rubbish is coming from uuencoded
viruses &c in the spam-set.

Anthony