[Spambayes] Outlook plugin - training

Tim Peters tim.one@comcast.net
Sun Nov 10 07:52:42 2002


[Rob Hooft]
> I just added a testdriver to CVS that simulates your behaviour as I
> understand it: It will train on the first 30 messages,

I trained on 1 of each at the start.  If I were to do it over, I'd start
with an empty database <wink>.

> plus on all misclassified and all unsure messages.

Since I'm doing this real-time on my live email, I've been training "on the
worst" (farthest away from correct) msg that arrives in a batch, then
rescoring all the ones that arrived in the batch, then training the worst
remaining, ... until all new ham is below ham_cutoff and all new spam above
spam_cutoff.  I don't know that it matters, just being clear(er).  As things
turned out, this worst-at-a-time training never managed to push one of the
remaining mistakes/unsures into the correct category, *except* for cases
where I got more than one copy of a spam from different accounts at the same
time.  Then it always pushed the copies into scoring near 1.0, since the
hapaxes in the training copy are abundant.

> It is called "weaktest.py", and uses the good-old-Data/{Sp|H}am
> hierarchy.
>
> I think we should test its performance at different Options settings.
>
> It may not even be very realistic to training on fp's, as I think in my
> private E-mail I won't even check the spam folder very thoroughly at all.

But I will (and do), and my primary interest here is to see how bad things
can get if a user takes mistake-based training to an extreme.  Despite that
it's heavily hapax-driven, it appears to do very well when judged by error
rate.

I've been doing it long enough now, though, that it doesn't do so well
subjectively:  the Unsures are too often bizarre.  For example, I sent a
long reply here to Robert Woodland, and the copy I get bock showed up as
Unsure, with H=1 and S=0.66.  There were a lot of accidental spam hapaxes in
that msg!  Training on it as ham then eliminated about 30 spam hapaxes
(there're now netural, having been seen in one ham and one spam each).

So it's no different from my POV than the cases where people have sent me
"surprising msgs" in the past, and my carefully trained slice-of-life
classifier (regularly trained on a sampling of correctly classified msgs
too) at the time had no trouble nailing them as ham or spam, with lots of
non-hapax evidence to back it up.

IOW, I'm still sticking to what I guessed before I started this:
mistake-driven training will appear to work well over the short term, but
it's brittle, and is brittle because of its reliance on hapaxes.

> Anyway, a default run for me now gives:
>
>    100 trained:31H+16S wrds:4203 fp:0 fn:0 unsure:47
>    200 trained:35H+25S wrds:6997 fp:0 fn:0 unsure:60
>    300 trained:38H+29S wrds:7503 fp:0 fn:0 unsure:67
>    400 trained:41H+32S wrds:8503 fp:0 fn:0 unsure:73
>    500 trained:45H+38S wrds:8887 fp:0 fn:0 unsure:83
>    600 trained:48H+39S wrds:9010 fp:0 fn:0 unsure:87
>    700 trained:57H+41S wrds:9484 fp:0 fn:0 unsure:98
>    800 trained:63H+43S wrds:9837 fp:0 fn:0 unsure:106
>    900 trained:63H+45S wrds:9936 fp:0 fn:0 unsure:108
>   1000 trained:67H+45S wrds:10001 fp:0 fn:0 unsure:112
>   1100 trained:72H+47S wrds:10268 fp:0 fn:0 unsure:119
>   1200 trained:72H+53S wrds:10386 fp:0 fn:0 unsure:125
>   1300 trained:77H+56S wrds:11178 fp:0 fn:0 unsure:133
>   1400 trained:81H+58S wrds:11546 fp:0 fn:0 unsure:139
>   1500 trained:85H+60S wrds:11734 fp:0 fn:0 unsure:145
>   1600 trained:87H+62S wrds:12023 fp:0 fn:0 unsure:149
>   1700 trained:89H+63S wrds:12161 fp:0 fn:0 unsure:152
>   1800 trained:93H+65S wrds:12287 fp:0 fn:0 unsure:158
>   1900 trained:93H+68S wrds:12449 fp:0 fn:0 unsure:161
>   2000 trained:96H+70S wrds:12637 fp:0 fn:0 unsure:166
>   2100 trained:100H+70S wrds:12742 fp:0 fn:0 unsure:170
>   2200 trained:103H+72S wrds:12984 fp:0 fn:0 unsure:175
>   2300 trained:105H+73S wrds:13047 fp:0 fn:0 unsure:178
>   2400 trained:108H+74S wrds:13220 fp:0 fn:0 unsure:182
>   2500 trained:111H+78S wrds:13407 fp:0 fn:0 unsure:189
>   2600 trained:112H+79S wrds:13485 fp:0 fn:0 unsure:191
>   2700 trained:115H+81S wrds:13647 fp:0 fn:0 unsure:196
>   2800 trained:118H+84S wrds:13797 fp:0 fn:0 unsure:202
>   2900 trained:120H+84S wrds:13845 fp:0 fn:0 unsure:204
>   3000 trained:123H+86S wrds:14131 fp:0 fn:0 unsure:209
> fp: Data/Ham/Set2/n05250.txt score:0.9312
>   3100 trained:128H+87S wrds:14327 fp:1 fn:0 unsure:214
>   3200 trained:129H+90S wrds:14430 fp:1 fn:0 unsure:218
>   3300 trained:132H+91S wrds:14633 fp:1 fn:0 unsure:222
>   3400 trained:133H+93S wrds:14923 fp:1 fn:1 unsure:224
>   3500 trained:133H+94S wrds:14937 fp:1 fn:1 unsure:225
>   3600 trained:133H+98S wrds:15023 fp:1 fn:1 unsure:229
>   3700 trained:135H+102S wrds:15463 fp:1 fn:1 unsure:235
>   3800 trained:135H+107S wrds:15627 fp:1 fn:1 unsure:240
>   3900 trained:138H+107S wrds:15786 fp:1 fn:1 unsure:243
>   4000 trained:140H+111S wrds:15951 fp:1 fn:1 unsure:249
>   4100 trained:142H+116S wrds:16115 fp:1 fn:1 unsure:256
>   4200 trained:142H+117S wrds:16124 fp:1 fn:1 unsure:257
>   4300 trained:143H+122S wrds:16251 fp:1 fn:1 unsure:263
>   4400 trained:143H+126S wrds:16366 fp:1 fn:1 unsure:267
>   4500 trained:144H+130S wrds:16434 fp:1 fn:1 unsure:272
>   4600 trained:144H+134S wrds:16599 fp:1 fn:1 unsure:276
>   4700 trained:146H+135S wrds:16664 fp:1 fn:1 unsure:279
>   4800 trained:147H+135S wrds:16682 fp:1 fn:1 unsure:280
>   4900 trained:149H+138S wrds:16911 fp:1 fn:1 unsure:285
> fp: Data/Ham/Set1/n01590.txt score:0.9092
>   5000 trained:151H+140S wrds:17257 fp:2 fn:1 unsure:288
>   5100 trained:153H+141S wrds:17390 fp:2 fn:1 unsure:291
>   5200 trained:155H+142S wrds:17747 fp:2 fn:1 unsure:294
>   5300 trained:156H+143S wrds:18095 fp:2 fn:1 unsure:296
>   5400 trained:159H+147S wrds:18205 fp:2 fn:1 unsure:303
>   5500 trained:160H+147S wrds:18230 fp:2 fn:1 unsure:304
>   5600 trained:163H+147S wrds:18334 fp:2 fn:1 unsure:307
>   5700 trained:163H+150S wrds:18410 fp:2 fn:1 unsure:310
>   5800 trained:165H+150S wrds:18455 fp:2 fn:1 unsure:312
>   5900 trained:168H+151S wrds:18671 fp:2 fn:1 unsure:316
>   6000 trained:170H+154S wrds:18764 fp:2 fn:1 unsure:321
>   6100 trained:170H+155S wrds:18787 fp:2 fn:1 unsure:322
>   6200 trained:170H+156S wrds:18791 fp:2 fn:1 unsure:323
>   6300 trained:174H+157S wrds:19095 fp:2 fn:1 unsure:328
>   6400 trained:176H+161S wrds:19398 fp:2 fn:2 unsure:333
>   6500 trained:178H+161S wrds:19444 fp:2 fn:2 unsure:335
> Total messages 6540 (4800 ham and 1740 spam)
> Total unsure (including 30 startup messages): 336 (5.1%)
> Trained on 178 ham and 162 spam
> fp: 2 fn: 2
> Total cost: $89.20
>
> (This is on 3 out of my 10 test directories).
>
> Interesting to note so far:
>   * The "Total cost" is much higher than for train-on-all schemes,
>     but it is only due to Unsures; fp and fn are still small.

That matches my experience too, although I started with 1 ham and 1 spam and
had high FP and FN rates over the first few hours.

>   * The database growth doesn't decay with time after a while;
>     it can be described as:
>        nwords = 9200 + 1.6 * nmessages
>     or alternatively:
>        nwords = 5700 + 40 * ntrained
>     ..as can be seen in the attached png's

I expect that's mostly because there are still (relatively) few total msgs
trained on.

>   * The training set is almost balanced, even though I scored
>     many more ham than spam

Curiously, same here!  I get about 500 ham and 100 spam per day, but my
training database now has 47 ham and 41 spam.  It does well, except when it
sucks <wink>.

>   * The unsure rate drops over time:

I haven't measured that, but it's clearly been so here too (as I said
before).

>          0- 1000: 11.2% (minus 3.0% to be fair)
>       1000- 2000:  5.4%
>       2000- 3000:  4.3%
>       3000- 4000:  4.0%
>       4000- 5000:  3.9%
>       5000- 6000:  3.3%

Proving what I've always suspected:  over time, all msgs are repetitions of
ones you've seen before <0.9 wink>.




More information about the Spambayes mailing list