[spambayes-dev] Wittel/Wu article on statistical attacks
Tony Meyer
tameyer at ihug.co.nz
Fri Sep 17 03:40:19 CEST 2004
Results (using the script in my previous message, with only minor changes):
Using a corpus made up of ham from the SA public archive and spam from there
and the SpamArchive.org collection (randomly selected), as in the paper,
with 3000+3000 toe, and the 10,000 common words referenced in the paper, I
get worse results:
(All SpamBayes defaults, basically current CVS code).
Base message scores: 0.927778857491
Words Spam Ham Unsure
10 142 0 858
25 28 39 933
50 7 319 674
100 0 781 219
200 0 986 14
300 0 997 3
400 0 975 25
However, the base message's score is nowhere near certain spam, so it's not
particularly surprising that adding random words drops the message into the
unsures. I'm not sure why they end up ham rather than spam, though.
Lacking headers is significant, of course.
With a rough 'nonedge' training system, I get:
Base message scores: 0.898231181245
Words Spam Ham Unsure
10 16 0 984
25 0 90 910
50 0 721 279
100 0 985 15
200 0 999 1
300 0 900 100
400 0 600 400
And with a rough 'fpfnunsure' training system, I get:
Base message scores: 0.870322335645
Words Spam Ham Unsure
10 13 1 986
25 0 486 514
50 0 892 108
100 0 998 2
200 0 1000 0
300 0 992 8
400 0 625 375
Both of these seem to be heading back towards the message being unsure,
rather than ham.
However, if I use a ham/spam corpus of my own (but the same base message),
then I get:
[toe]
Base message scores: 0.998165135488
Words Spam Ham Unsure
10 941 0 59
25 792 0 208
50 657 0 343
100 505 0 495
200 387 0 613
300 97 0 903
400 173 0 827
[nonedge]
Base message scores: 0.987644137822
Words Spam Ham Unsure
10 847 0 153
25 751 0 249
50 706 0 294
100 719 0 281
200 840 0 160
300 904 0 96
400 954 0 46
[fpfnunsure]
Base message scores: 0.999791130851
Words Spam Ham Unsure
10 981 0 19
25 894 0 106
50 817 0 183
100 780 0 220
200 823 0 177
300 865 0 135
400 918 0 82
These are probably more the results one would expect...
=Tony Meyer
More information about the spambayes-dev
mailing list