[Spambayes] Outlook plugin - training
Rob Hooft
rob@hooft.net
Sat Nov 9 22:24:52 2002
This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
Tim Peters wrote:
> [Tim]
>>I'm never going to get sub-0.1% error rates this way, but if this is the
>>best it ever got, I'd be quite happy with it for my personal email.
> BTW, I'm still doing this experiment, and my total training data is up to 45
> ham and 38 spam, out of a total of about 1,700 msgs processed so far. FP
> are FN are both rare now, and the Unsure rate is about 5% overall and
> visibly falling.
I just added a testdriver to CVS that simulates your behaviour as I
understand it: It will train on the first 30 messages, plus on all
misclassified and all unsure messages. It is called "weaktest.py", and
uses the good-old-Data/{Sp|H}am hierarchy.
I think we should test its performance at different Options settings.
It may not even be very realistic to training on fp's, as I think in my
private E-mail I won't even check the spam folder very thoroughly at all.
Anyway, a default run for me now gives:
100 trained:31H+16S wrds:4203 fp:0 fn:0 unsure:47
200 trained:35H+25S wrds:6997 fp:0 fn:0 unsure:60
300 trained:38H+29S wrds:7503 fp:0 fn:0 unsure:67
400 trained:41H+32S wrds:8503 fp:0 fn:0 unsure:73
500 trained:45H+38S wrds:8887 fp:0 fn:0 unsure:83
600 trained:48H+39S wrds:9010 fp:0 fn:0 unsure:87
700 trained:57H+41S wrds:9484 fp:0 fn:0 unsure:98
800 trained:63H+43S wrds:9837 fp:0 fn:0 unsure:106
900 trained:63H+45S wrds:9936 fp:0 fn:0 unsure:108
1000 trained:67H+45S wrds:10001 fp:0 fn:0 unsure:112
1100 trained:72H+47S wrds:10268 fp:0 fn:0 unsure:119
1200 trained:72H+53S wrds:10386 fp:0 fn:0 unsure:125
1300 trained:77H+56S wrds:11178 fp:0 fn:0 unsure:133
1400 trained:81H+58S wrds:11546 fp:0 fn:0 unsure:139
1500 trained:85H+60S wrds:11734 fp:0 fn:0 unsure:145
1600 trained:87H+62S wrds:12023 fp:0 fn:0 unsure:149
1700 trained:89H+63S wrds:12161 fp:0 fn:0 unsure:152
1800 trained:93H+65S wrds:12287 fp:0 fn:0 unsure:158
1900 trained:93H+68S wrds:12449 fp:0 fn:0 unsure:161
2000 trained:96H+70S wrds:12637 fp:0 fn:0 unsure:166
2100 trained:100H+70S wrds:12742 fp:0 fn:0 unsure:170
2200 trained:103H+72S wrds:12984 fp:0 fn:0 unsure:175
2300 trained:105H+73S wrds:13047 fp:0 fn:0 unsure:178
2400 trained:108H+74S wrds:13220 fp:0 fn:0 unsure:182
2500 trained:111H+78S wrds:13407 fp:0 fn:0 unsure:189
2600 trained:112H+79S wrds:13485 fp:0 fn:0 unsure:191
2700 trained:115H+81S wrds:13647 fp:0 fn:0 unsure:196
2800 trained:118H+84S wrds:13797 fp:0 fn:0 unsure:202
2900 trained:120H+84S wrds:13845 fp:0 fn:0 unsure:204
3000 trained:123H+86S wrds:14131 fp:0 fn:0 unsure:209
fp: Data/Ham/Set2/n05250.txt score:0.9312
3100 trained:128H+87S wrds:14327 fp:1 fn:0 unsure:214
3200 trained:129H+90S wrds:14430 fp:1 fn:0 unsure:218
3300 trained:132H+91S wrds:14633 fp:1 fn:0 unsure:222
3400 trained:133H+93S wrds:14923 fp:1 fn:1 unsure:224
3500 trained:133H+94S wrds:14937 fp:1 fn:1 unsure:225
3600 trained:133H+98S wrds:15023 fp:1 fn:1 unsure:229
3700 trained:135H+102S wrds:15463 fp:1 fn:1 unsure:235
3800 trained:135H+107S wrds:15627 fp:1 fn:1 unsure:240
3900 trained:138H+107S wrds:15786 fp:1 fn:1 unsure:243
4000 trained:140H+111S wrds:15951 fp:1 fn:1 unsure:249
4100 trained:142H+116S wrds:16115 fp:1 fn:1 unsure:256
4200 trained:142H+117S wrds:16124 fp:1 fn:1 unsure:257
4300 trained:143H+122S wrds:16251 fp:1 fn:1 unsure:263
4400 trained:143H+126S wrds:16366 fp:1 fn:1 unsure:267
4500 trained:144H+130S wrds:16434 fp:1 fn:1 unsure:272
4600 trained:144H+134S wrds:16599 fp:1 fn:1 unsure:276
4700 trained:146H+135S wrds:16664 fp:1 fn:1 unsure:279
4800 trained:147H+135S wrds:16682 fp:1 fn:1 unsure:280
4900 trained:149H+138S wrds:16911 fp:1 fn:1 unsure:285
fp: Data/Ham/Set1/n01590.txt score:0.9092
5000 trained:151H+140S wrds:17257 fp:2 fn:1 unsure:288
5100 trained:153H+141S wrds:17390 fp:2 fn:1 unsure:291
5200 trained:155H+142S wrds:17747 fp:2 fn:1 unsure:294
5300 trained:156H+143S wrds:18095 fp:2 fn:1 unsure:296
5400 trained:159H+147S wrds:18205 fp:2 fn:1 unsure:303
5500 trained:160H+147S wrds:18230 fp:2 fn:1 unsure:304
5600 trained:163H+147S wrds:18334 fp:2 fn:1 unsure:307
5700 trained:163H+150S wrds:18410 fp:2 fn:1 unsure:310
5800 trained:165H+150S wrds:18455 fp:2 fn:1 unsure:312
5900 trained:168H+151S wrds:18671 fp:2 fn:1 unsure:316
6000 trained:170H+154S wrds:18764 fp:2 fn:1 unsure:321
6100 trained:170H+155S wrds:18787 fp:2 fn:1 unsure:322
6200 trained:170H+156S wrds:18791 fp:2 fn:1 unsure:323
6300 trained:174H+157S wrds:19095 fp:2 fn:1 unsure:328
6400 trained:176H+161S wrds:19398 fp:2 fn:2 unsure:333
6500 trained:178H+161S wrds:19444 fp:2 fn:2 unsure:335
Total messages 6540 (4800 ham and 1740 spam)
Total unsure (including 30 startup messages): 336 (5.1%)
Trained on 178 ham and 162 spam
fp: 2 fn: 2
Total cost: $89.20
(This is on 3 out of my 10 test directories).
Interesting to note so far:
* The "Total cost" is much higher than for train-on-all schemes,
but it is only due to Unsures; fp and fn are still small.
* The database growth doesn't decay with time after a while;
it can be described as:
nwords = 9200 + 1.6 * nmessages
or alternatively:
nwords = 5700 + 40 * ntrained
..as can be seen in the attached png's
* The training set is almost balanced, even though I scored
many more ham than spam
* The unsure rate drops over time:
0- 1000: 11.2% (minus 3.0% to be fair)
1000- 2000: 5.4%
2000- 3000: 4.3%
3000- 4000: 4.0%
4000- 5000: 3.9%
5000- 6000: 3.3%
Rob
--
Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/
---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: words1.png
Type: image/png
Size: 12191 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20021109/85c3f3b5/words1-0001.png
---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: words2.png
Type: image/png
Size: 12807 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20021109/85c3f3b5/words2-0001.png
---------------------- multipart/mixed attachment--
More information about the Spambayes
mailing list