[spambayes-dev] Wittel/Wu article on statistical attacks

Fri Sep 17 03:40:19 CEST 2004

Results (using the script in my previous message, with only minor changes):

Using a corpus made up of ham from the SA public archive and spam from there
and the SpamArchive.org collection (randomly selected), as in the paper,
with 3000+3000 toe, and the 10,000 common words referenced in the paper, I
get worse results:

(All SpamBayes defaults, basically current CVS code).

Base message scores: 0.927778857491
Words  Spam   Ham  Unsure
   10   142     0     858
   25    28    39     933
   50     7   319     674
  100     0   781     219
  200     0   986      14
  300     0   997       3
  400     0   975      25

However, the base message's score is nowhere near certain spam, so it's not
particularly surprising that adding random words drops the message into the
unsures.  I'm not sure why they end up ham rather than spam, though.
Lacking headers is significant, of course.

With a rough 'nonedge' training system, I get:

Base message scores: 0.898231181245
Words  Spam   Ham  Unsure
   10    16     0     984
   25     0    90     910
   50     0   721     279
  100     0   985      15
  200     0   999       1
  300     0   900     100
  400     0   600     400

And with a rough 'fpfnunsure' training system, I get:

Base message scores: 0.870322335645
Words  Spam   Ham  Unsure
   10    13     1     986
   25     0   486     514
   50     0   892     108
  100     0   998       2
  200     0  1000       0
  300     0   992       8
  400     0   625     375

Both of these seem to be heading back towards the message being unsure,
rather than ham.

However, if I use a ham/spam corpus of my own (but the same base message),
then I get:

[toe]
Base message scores: 0.998165135488
Words  Spam   Ham  Unsure
   10   941     0      59
   25   792     0     208
   50   657     0     343
  100   505     0     495
  200   387     0     613
  300    97     0     903
  400   173     0     827

[nonedge]
Base message scores: 0.987644137822
Words  Spam   Ham  Unsure
   10   847     0     153
   25   751     0     249
   50   706     0     294
  100   719     0     281
  200   840     0     160
  300   904     0      96
  400   954     0      46

[fpfnunsure]
Base message scores: 0.999791130851
Words  Spam   Ham  Unsure
   10   981     0      19
   25   894     0     106
   50   817     0     183
  100   780     0     220
  200   823     0     177
  300   865     0     135
  400   918     0      82

These are probably more the results one would expect...

=Tony Meyer