[Spambayes] Outlook plugin - training

Rob Hooft rob@hooft.net
Sat Nov 9 22:24:52 2002

This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
Tim Peters wrote:
> [Tim]

>>I'm never going to get sub-0.1% error rates this way, but if this is the
>>best it ever got, I'd be quite happy with it for my personal email. 

> BTW, I'm still doing this experiment, and my total training data is up to 45
> ham and 38 spam, out of a total of about 1,700 msgs processed so far.  FP
> are FN are both rare now, and the Unsure rate is about 5% overall and
> visibly falling. 

I just added a testdriver to CVS that simulates your behaviour as I 
understand it: It will train on the first 30 messages, plus on all 
misclassified and all unsure messages. It is called "weaktest.py", and 
uses the good-old-Data/{Sp|H}am hierarchy.

I think we should test its performance at different Options settings.

It may not even be very realistic to training on fp's, as I think in my 
private E-mail I won't even check the spam folder very thoroughly at all.

Anyway, a default run for me now gives:

   100 trained:31H+16S wrds:4203 fp:0 fn:0 unsure:47
   200 trained:35H+25S wrds:6997 fp:0 fn:0 unsure:60
   300 trained:38H+29S wrds:7503 fp:0 fn:0 unsure:67
   400 trained:41H+32S wrds:8503 fp:0 fn:0 unsure:73
   500 trained:45H+38S wrds:8887 fp:0 fn:0 unsure:83
   600 trained:48H+39S wrds:9010 fp:0 fn:0 unsure:87
   700 trained:57H+41S wrds:9484 fp:0 fn:0 unsure:98
   800 trained:63H+43S wrds:9837 fp:0 fn:0 unsure:106
   900 trained:63H+45S wrds:9936 fp:0 fn:0 unsure:108
  1000 trained:67H+45S wrds:10001 fp:0 fn:0 unsure:112
  1100 trained:72H+47S wrds:10268 fp:0 fn:0 unsure:119
  1200 trained:72H+53S wrds:10386 fp:0 fn:0 unsure:125
  1300 trained:77H+56S wrds:11178 fp:0 fn:0 unsure:133
  1400 trained:81H+58S wrds:11546 fp:0 fn:0 unsure:139
  1500 trained:85H+60S wrds:11734 fp:0 fn:0 unsure:145
  1600 trained:87H+62S wrds:12023 fp:0 fn:0 unsure:149
  1700 trained:89H+63S wrds:12161 fp:0 fn:0 unsure:152
  1800 trained:93H+65S wrds:12287 fp:0 fn:0 unsure:158
  1900 trained:93H+68S wrds:12449 fp:0 fn:0 unsure:161
  2000 trained:96H+70S wrds:12637 fp:0 fn:0 unsure:166
  2100 trained:100H+70S wrds:12742 fp:0 fn:0 unsure:170
  2200 trained:103H+72S wrds:12984 fp:0 fn:0 unsure:175
  2300 trained:105H+73S wrds:13047 fp:0 fn:0 unsure:178
  2400 trained:108H+74S wrds:13220 fp:0 fn:0 unsure:182
  2500 trained:111H+78S wrds:13407 fp:0 fn:0 unsure:189
  2600 trained:112H+79S wrds:13485 fp:0 fn:0 unsure:191
  2700 trained:115H+81S wrds:13647 fp:0 fn:0 unsure:196
  2800 trained:118H+84S wrds:13797 fp:0 fn:0 unsure:202
  2900 trained:120H+84S wrds:13845 fp:0 fn:0 unsure:204
  3000 trained:123H+86S wrds:14131 fp:0 fn:0 unsure:209
fp: Data/Ham/Set2/n05250.txt score:0.9312
  3100 trained:128H+87S wrds:14327 fp:1 fn:0 unsure:214
  3200 trained:129H+90S wrds:14430 fp:1 fn:0 unsure:218
  3300 trained:132H+91S wrds:14633 fp:1 fn:0 unsure:222
  3400 trained:133H+93S wrds:14923 fp:1 fn:1 unsure:224
  3500 trained:133H+94S wrds:14937 fp:1 fn:1 unsure:225
  3600 trained:133H+98S wrds:15023 fp:1 fn:1 unsure:229
  3700 trained:135H+102S wrds:15463 fp:1 fn:1 unsure:235
  3800 trained:135H+107S wrds:15627 fp:1 fn:1 unsure:240
  3900 trained:138H+107S wrds:15786 fp:1 fn:1 unsure:243
  4000 trained:140H+111S wrds:15951 fp:1 fn:1 unsure:249
  4100 trained:142H+116S wrds:16115 fp:1 fn:1 unsure:256
  4200 trained:142H+117S wrds:16124 fp:1 fn:1 unsure:257
  4300 trained:143H+122S wrds:16251 fp:1 fn:1 unsure:263
  4400 trained:143H+126S wrds:16366 fp:1 fn:1 unsure:267
  4500 trained:144H+130S wrds:16434 fp:1 fn:1 unsure:272
  4600 trained:144H+134S wrds:16599 fp:1 fn:1 unsure:276
  4700 trained:146H+135S wrds:16664 fp:1 fn:1 unsure:279
  4800 trained:147H+135S wrds:16682 fp:1 fn:1 unsure:280
  4900 trained:149H+138S wrds:16911 fp:1 fn:1 unsure:285
fp: Data/Ham/Set1/n01590.txt score:0.9092
  5000 trained:151H+140S wrds:17257 fp:2 fn:1 unsure:288
  5100 trained:153H+141S wrds:17390 fp:2 fn:1 unsure:291
  5200 trained:155H+142S wrds:17747 fp:2 fn:1 unsure:294
  5300 trained:156H+143S wrds:18095 fp:2 fn:1 unsure:296
  5400 trained:159H+147S wrds:18205 fp:2 fn:1 unsure:303
  5500 trained:160H+147S wrds:18230 fp:2 fn:1 unsure:304
  5600 trained:163H+147S wrds:18334 fp:2 fn:1 unsure:307
  5700 trained:163H+150S wrds:18410 fp:2 fn:1 unsure:310
  5800 trained:165H+150S wrds:18455 fp:2 fn:1 unsure:312
  5900 trained:168H+151S wrds:18671 fp:2 fn:1 unsure:316
  6000 trained:170H+154S wrds:18764 fp:2 fn:1 unsure:321
  6100 trained:170H+155S wrds:18787 fp:2 fn:1 unsure:322
  6200 trained:170H+156S wrds:18791 fp:2 fn:1 unsure:323
  6300 trained:174H+157S wrds:19095 fp:2 fn:1 unsure:328
  6400 trained:176H+161S wrds:19398 fp:2 fn:2 unsure:333
  6500 trained:178H+161S wrds:19444 fp:2 fn:2 unsure:335
Total messages 6540 (4800 ham and 1740 spam)
Total unsure (including 30 startup messages): 336 (5.1%)
Trained on 178 ham and 162 spam
fp: 2 fn: 2
Total cost: $89.20

(This is on 3 out of my 10 test directories).

Interesting to note so far:
  * The "Total cost" is much higher than for train-on-all schemes,
    but it is only due to Unsures; fp and fn are still small.
  * The database growth doesn't decay with time after a while;
    it can be described as:
       nwords = 9200 + 1.6 * nmessages
    or alternatively:
       nwords = 5700 + 40 * ntrained
    ..as can be seen in the attached png's
  * The training set is almost balanced, even though I scored
    many more ham than spam
  * The unsure rate drops over time:
         0- 1000: 11.2% (minus 3.0% to be fair)
      1000- 2000:  5.4%
      2000- 3000:  4.3%
      3000- 4000:  4.0%
      4000- 5000:  3.9%
      5000- 6000:  3.3%


Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: words1.png
Type: image/png
Size: 12191 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20021109/85c3f3b5/words1-0001.png

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: words2.png
Type: image/png
Size: 12807 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20021109/85c3f3b5/words2-0001.png

---------------------- multipart/mixed attachment--

More information about the Spambayes mailing list