RE: [Spambayes] Outlook plugin - training
I don't believe you need this. I think that the classifier automatically trains on messages as they arrive (or at least on messages that it's sure about). You only need to retrain if it has made a mistake, or if it's unsure. Piers.
-----Original Message----- From: Moore, Paul [mailto:Paul.Moore@atosorigin.com] Sent: Wednesday, November 06, 2002 2:09 AM To: Spambayes (E-mail) Subject: [Spambayes] Outlook plugin - training
When the Outlook plugin filters mails, it classifies them as either spam or potential spam, and can put them in appropriate folders.
In the spam/potential spam folders, there is a "Recover from Spam" button available, and in other folders there is a "Delete as spam" button. These buttons add the message to the training database as well as taking the appropriate action.
One thing I don't see, however, is a means of confirming the classifier's decisions as correct. A "yes, that is spam" button for the spam folder, and a "yes, that's ham" button in non-spam folders.
As I'm starting from a very small message base, I worry that correct classifications are still somewhat based on "luck", and training based on correct decisions would help to increase both my and the classifier's confidence level.
I can do this by regular retraining, but that has 2 disadvantages: it's much clumsier than simply clicking on a "clever boy!" button, and it relies on me not deleting messages until I do a training run. Much of the ham I get is "read and forget", so I'd rather delete immediately.
When I get a chance to dive into the code, I'll see how hard this would be to implement.
Paul.
_______________________________________________ Spambayes mailing list Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes
[Piers responding to Paul]
I don't believe you need this. I think that the classifier automatically trains on messages as they arrive (or at least on messages that it's sure about). You only need to retrain if it has made a mistake, or if it's unsure.
As Tim says, we really only do "mistake" training - nothing is trained as it comes in, only scored. Manually moving messages (via the button or d&d) is the only thing that triggers an incremental re-train. The key limitation of this scheme, as Tim also alludes to, is that this never correctly classifies ham. However, I actually see this incremental training more as a "get smarter now" than a "just get smarter" technique - ie, a user sees a mis-classified Spam, by re-training they are increasing the chances that the next similar mail will be handled correctly. Instant feedback, especially while the user is getting started. ie, it is indeed "mistake based training", but that may still prove useful in addition to ongoing training. I can't help thinking that we are somehow underestimating our own tool here. As is common when people first use this tool, spam is generally found in the ham set and vice-versa. Because of this, I know that my Inbox is spam free (but less sure about my other "ham" folders). I'm also sure that my Spam folder has no ham. This should remain true while I continue to use the tool. So surely we can exploit this somehow. Off the top of my head: * Assume we don't trust the last 2 days of mail (as the user may not yet have sorted them). Anything in the "good" and "spam" folders older than this can be assumed correctly classified, and able to be trained on. * A process could go through all ham and spam trained on, and score each message. Any "suspect" messages are presented in a list (much like the Outlook "Find Message" result list). The user can indicate that the message is correct (and the system will remember, never asking about this message again) or is indeed incorrectly classified. If incorrect, it will be moved, and incrementally trained as per now. (I can also picture a whitelist kicking in here; if incorrect, offer to add user to whitelist. If user in the whitelist, assume ham thereby meaning mail from this person can never again be spam) I can picture this working in the background, and simply indicating to the user that there are "conflicts" to be resolved at their leisure. Further, I imagine that as we build better training data for each message store, the number of "conflicts" actually found would generally be zero - ie, the system would find that all 2 day and older mail correctly classifies. While the above is more a brain-fart than a reasoned design, I agree that staying out of your face is important for widespread use. Mark.
"Mark Hammond" <mhammond@skippinet.com.au> writes:
ie, it is indeed "mistake based training", but that may still prove useful in addition to ongoing training.
From a newcomer's point of view, I think a key point is that "mistake based training" is easy to understand.
I also believe that "confirmation based training" (my "clever boy!" button for specifically affirming that the classifier's magic gave the right answer) is easy to understand. More than that, a new user *expects* to need to do something like this, as the initial impression is one of amazement at the accuracy of the classifier. But such a gadget will fall into disuse as the user starts to expect the classifier to be right - so it probably doesn't have enough long-term value to be worth providing. Batch training (keeping ham and spam, and pumping it into the classifier in a regular training run) feels highly unnatural. My instinct is to *delete* spam - keeping it feels wrong.
I can't help thinking that we are somehow underestimating our own tool here.
Coming at it from cold, I can confirm that the effect feels like pure magic. I trained on what I thought was a uselessly small corpus (I had *no* historical spam, so I retrieved the day's batch from the wastebin and used that). The results have been so good that I can already, 2 days later, feel myself tending to "trust" the classifier, and forgetting about training issues. But unlike Mark, my instinct is that this is not such a good thing (solely from a training point of view). If people get such good results on inadequate training, they won't work at it enough, so the need is to make good training so easy and automatic that the tendency to forget to bother is offset. It's too late to think this through right now. I'll ponder some more in the morning... Paul. -- This signature intentionally left blank
[Mark Hammond]
... The key limitation of this scheme, as Tim also alludes to, is that this never correctly classifies ham. However, I actually see this incremental training more as a "get smarter now" than a "just get smarter" technique - ie, a user sees a mis-classified Spam, by re- training they are increasing the chances that the next similar mail will be handled correctly. Instant feedback, especially while the user is getting started.
ie, it is indeed "mistake based training", but that may still prove useful in addition to ongoing training.
I sure agree it's *very* useful at the start, and expect it will continue to be useful over time.
I can't help thinking that we are somehow underestimating our own tool here.
I'm going to try an experiment: I'm going to wipe my home database and start over from scratch, training first on one ham and one spam, then only on mistakes and unsures. This should be fun <wink>.
As is common when people first use this tool, spam is generally found in the ham set and vice-versa. Because of this, I know that my Inbox is spam free (but less sure about my other "ham" folders). I'm also sure that my Spam folder has no ham. This should remain true while continue to use the tool.
How do you know your Spam folder has no ham? I know mine doesn't because I routinely score it, sort on the score, and stare at "the wrong end". I find ham there as often as not, *usually* apparently due to mousing error when dragging a training ham into the Ham folder and overshooting the mark.
So surely we can exploit this somehow. Off the top of my head: * Assume we don't trust the last 2 days of mail (as the user may not yet have sorted them). Anything in the "good" and "spam" folders older than this can be assumed correctly classified, and able to be trained on.
Provided the user has already done a decent amount of training, then as Paul Moore suggested it could even work to trust ham-vs-spam decisions immediately, and let user corrections undo those as needed. A well-trained system should be pretty robust against a few misclassifications over the short term.
* A process could go through all ham and spam trained on, and score each message. Any "suspect" messages are presented in a list (much like the Outlook "Find Message" result list). The user can indicate that the message is correct (and the system will remember, never asking about this message again) or is indeed incorrectly classified. If incorrect, it will be moved, and incrementally trained as per now. (I can also picture a whitelist kicking in here; if incorrect, offer to add user to whitelist. If user in the whitelist, assume ham thereby meaning mail from this person can never again be spam)
Tell us about the mistakes *you* see. I feel like we're designing a solution to a hypothetical problem otherwise. The only "mistake" I routinely see is that my cigarettes-via-web advertising keeps getting knocked back into Unsure territory. That doesn't bother me enough to do anything about it, but if it bothers you enough <wink> then, yes, a whitelist would solve that one.
I can picture this working in the background, and simply indicating to the user that there are "conflicts" to be resolved at their leisure.
Or maybe we could just move those back to the Unsure folder. The user should already know what to do about things in Unsure, so it's nothing new to them. Moving a msg out of Unsure could be taken as a positive sign that the user has classified such a msg once and for all (well, until they move it again, anyway).
Further, I imagine that as we build better training data for each message store, the number of "conflicts" actually found would generally be zero - ie, the system would find that all 2 day and older mail correctly classifies.
I expect that's true.
[Tim]
... I'm going to try an experiment: I'm going to wipe my home database and start over from scratch, training first on one ham and one spam, then only on mistakes and unsures. This should be fun <wink>.
It is! The msg from me I'm replying to here scored 94 (solid spam). I've now got 5 ham and 5 spam in my training set, most of the new ones from Unsures. The latest spam was a blatant false negative, from Hapax City: '*H*' 0.998601 '*S*' 8.60833e-005 'can' 0.0652174 'have' 0.0652174 "don't" 0.0918367 'never' 0.0918367 'number' 0.0918367 'one' 0.0918367 'what' 0.0918367 '"the' 0.155172 ham hapaxes from here 'able' 0.155172 'about' 0.155172 'against' 0.155172 'also' 0.155172 'any' 0.155172 'anything' 0.155172 'back' 0.155172 'because' 0.155172 'been' 0.155172 'check' 0.155172 'even' 0.155172 'find' 0.155172 'found' 0.155172 'heard' 0.155172 'how' 0.155172 'into' 0.155172 "it's" 0.155172 'more' 0.155172 'needed' 0.155172 'other' 0.155172 'out' 0.155172 'own' 0.155172 'people' 0.155172 'skip:a 10' 0.155172 'skip:i 10' 0.155172 'special' 0.155172 'subject:.' 0.155172 'subject:: ' 0.155172 'their' 0.155172 'them.' 0.155172 'they' 0.155172 'those' 0.155172 'time' 0.155172 'time.' 0.155172 'unsubscribe' 0.155172 'until' 0.155172 'useful' 0.155172 'using' 0.155172 to here 'and' 0.275281 'for' 0.275281 'subject: ' 0.275281 'you' 0.275281 'from' 0.355072 'not' 0.355072 'off' 0.355072 'our' 0.355072 'when' 0.355072 'new' 0.644928 'see' 0.644928 'url:gif' 0.724719 'url:www' 0.724719 'call' 0.844828 spam hapaxes from here 'contact' 0.844828 'credit' 0.844828 'email.' 0.844828 'every' 0.844828 'further' 0.844828 'header:Received:2' 0.844828 'made' 0.844828 'more!' 0.844828 'most' 0.844828 'now' 0.844828 'plus,' 0.844828 'receive' 0.844828 'search' 0.844828 'skip:1 10' 0.844828 'url:jpg' 0.844828 to here 'email' 0.908163 I think I've established that 5+5 isn't enough for great results <snort>. However, 80% of its decisions have been correct so far!
[Tim]
... I'm going to try an experiment: I'm going to wipe my home database and start over from scratch, training first on one ham and one spam, then only on mistakes and unsures. This should be fun <wink>. ...
After enduring the first round of gross mistakes, when I got up today I did this: while some ham in my inbox scores above 0.20 (my ham_cutoff): pick the highest-scoring ham in the inbox add it to the ham training set train on it rescore the inbox These are false positives and unsures the classifier would have had if these msgs had come in after I started the experiment. There were about 700 msgs in the inbox. Other than that, I've left it mistake-driven and unsure-driven on live incoming email. Spam that's correctly classified simply gets deleted (no training on it), ditto ham. It's been a light spam day, but hundreds of msgs have come in since then and I haven't seen a mistake or unsure in about 5 hours, although plenty of ham gets near ham_cutoff and plenty of spam near spam_cutoff. Total training data now: just 45 ham and 20 spam. Scores remain grossly hapax-driven, but that's actually enough to classify most of my email correctly: a small number of subjects and senders and mailing lists overwhelmingly dominate my ham mix, and one email account accounts for the vast bulk of my spam. Removing the hapaxes from the database dropped the # of words from 5500 to about 1700. Rescoring the inbox with this reduced database then pushed about 5% of the msgs back into Unsure. So (no surprise here) hapaxes are vital with little training data. That also means that as soon as one of those words shows up in the other kind of email, it changes from a strong clue to netural, *provided that* I actually train on the new email. I'm not training now unless there's a mistake/unsure, so the hapaxes remain strong clues (even when they point in the wrong direction). BTW, when there are mistakes/unsures, I'm not training on all of them: as I did when I got up, I train the worst example then rescore, one at a time, until no mistakes/unsures remain. I'm never going to get sub-0.1% error rates this way, but if this is the best it ever got, I'd be quite happy with it for my personal email. Something to ponder? If so, you can get away with a very small database, and while hapaxes must not be removed blindly in this extreme scheme, using the atime field could (I suspect) be very effective in slashing the already-small database size (lots of hapaxes will never be seen again even if you train on everything; the WordInfo atime field tells you when a word was last used at all).
Tim Peters wrote:
I'm never going to get sub-0.1% error rates this way, but if this is the best it ever got, I'd be quite happy with it for my personal email. Something to ponder? If so, you can get away with a very small database, and while hapaxes must not be removed blindly in this extreme scheme, using the atime field could (I suspect) be very effective in slashing the already-small database size (lots of hapaxes will never be seen again even if you train on everything; the WordInfo atime field tells you when a word was last used at all).
Tim, This seems to imply that you're still playing with the idea that hapaxes could be "slashed" from the database when using the "old" train-on-all procedure. I don't see how that can ever work, as all words pass through the hapax stage at some point. Or do you mean to slash "old" hapaxes only? And what is "old"? Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/
[Tim]
I'm never going to get sub-0.1% error rates this way, but if this is the best it ever got, I'd be quite happy with it for my personal email. Something to ponder? If so, you can get away with a very small database, and while hapaxes must not be removed blindly in this extreme scheme, using the atime field could (I suspect) be very effective in slashing the already-small database size (lots of hapaxes will never be seen again even if you train on everything; the WordInfo atime field tells you when a word was last used at all).
BTW, I'm still doing this experiment, and my total training data is up to 45 ham and 38 spam, out of a total of about 1,700 msgs processed so far. FP are FN are both rare now, and the Unsure rate is about 5% overall and visibly falling. The Unsure spam are more surprising than the Unsure ham, but that may be more psychological than real. For example, it took about 24 hours before I got my first Nigerian spam, and it was shocking to see it score at the low end of the Unsure range. Looking at the internals is scary. I have entire folders that are called ham seemingly because the mailing list they come from has a few lexical conventions unique to it, and the hapaxes from the single training msg from that list save almost all of that list's msgs from Unsure status. In the msg of Rob's I'm replying to, these are all ham hapaxes: 'database' 0.155172 'database,' 0.155172 'ever' 0.155172 'idea' 0.155172 'quite' 0.155172 'scheme,' 0.155172 'seen' 0.155172 'subject:Outlook' 0.155172 'subject:Spambayes' 0.155172 'subject:plugin' 0.155172 'subject:training' 0.155172 'tells' 0.155172 'words' 0.155172 and they slug it out with these spam hapaxes: 'away' 0.844828 'effective' 0.844828 'field' 0.844828 'mean' 0.844828 'word' 0.844828 That 'word' is a strong spam clue but 'words' a strong ham clue should tell us something about how robust this is <wink>. [Rob Hooft]
This seems to imply that you're still playing with the idea that hapaxes could be "slashed" from the database when using the "old" train-on-all procedure. I don't see how that can ever work, as all words pass through the hapax stage at some point. Or do you mean to slash "old" hapaxes only?
Well, training has no effect on scoring until update_probabilities() is called, and in a batch-training context I mean hapax from update_probabilities's POV. Of course hamcounts or spamcounts for new words start life at 1, but when doing batch training I don't mean to look at the counts until the probabilities are updated. At that point, a hapax is a word that was seen in only one msg from the entire batch of new msgs. Here's a quick test, based on unpublished general python.org email (we can't publish the ham because it includes some personal email; GregW was working on making the spam collection available, but I haven't heard about that in a week; ditto his very large python.org virus collection). In each case, it trains on 2,741 ham and 948 spam, then predicts the same numbers of each. The "all" column includes hapaxes (wrt counts at the *end* of training). The gt1 column threw away words at the end of training where spamcount+hamcount <= 1; i.e., it retained only words that appeared more than once, the non-hapaxes. The gt2 column retained only words that appeared more than twice; and so on. ham_cutoff was 0.20 here, and spam_cutoff 0.90. filename: all gt1 gt2 gt3 gt4 gt5 gt6 ham:spam: 2741:948 2741:948 2741:948 2741:948 2741:948 2741:948 2741:948 fp total: 1 0 1 0 0 0 0 fp %: 0.04 0.00 0.04 0.00 0.00 0.00 0.00 fn total: 2 2 2 1 2 3 4 fn %: 0.21 0.21 0.21 0.11 0.21 0.32 0.42 unsure t: 81 87 89 82 98 96 100 unsure %: 2.20 2.36 2.41 2.22 2.66 2.60 2.71 real cost: $28.20 $19.40 $29.80 $17.40 $21.60 $22.20 $24.00 best cost: $22.20 $17.60 $20.00 $15.40 $16.80 $17.40 $22.40 h mean: 0.81 0.86 0.87 0.72 0.67 0.64 0.65 h sdev: 6.05 6.18 6.17 5.42 5.13 4.94 5.11 s mean: 98.00 97.66 97.54 97.38 97.03 96.62 96.52 s sdev: 9.26 10.22 10.37 10.62 11.19 12.49 12.61 mean diff: 97.19 96.80 96.67 96.66 96.36 95.98 95.87 k: 6.35 5.90 5.84 6.03 5.90 5.51 5.41 # retained words: 74327 36437 23877 16143 12798 10719 9157 So while hapaxes are vital with very little training data, even with "just" about 4K training msgs they didn't buy anything in this test, and neither did words that appeared only two or three times, and it doesn't appear to be touchy (all of these columns show excellent results!).
And what is "old"?
That remains a good question, and a good answer may differ between personal email and bulk email applications. A problem I see coming up in my personal email is that some correspondents only show up once a year, and the hapaxes they generate remain valuable clues, but only once a year. General python.org email doesn't appear to suffer anything like that (so long as personal email is kept out of the python.org mix).
Tim Peters wrote:
[Tim]
I'm never going to get sub-0.1% error rates this way, but if this is the best it ever got, I'd be quite happy with it for my personal email.
BTW, I'm still doing this experiment, and my total training data is up to 45 ham and 38 spam, out of a total of about 1,700 msgs processed so far. FP are FN are both rare now, and the Unsure rate is about 5% overall and visibly falling.
I just added a testdriver to CVS that simulates your behaviour as I understand it: It will train on the first 30 messages, plus on all misclassified and all unsure messages. It is called "weaktest.py", and uses the good-old-Data/{Sp|H}am hierarchy. I think we should test its performance at different Options settings. It may not even be very realistic to training on fp's, as I think in my private E-mail I won't even check the spam folder very thoroughly at all. Anyway, a default run for me now gives: 100 trained:31H+16S wrds:4203 fp:0 fn:0 unsure:47 200 trained:35H+25S wrds:6997 fp:0 fn:0 unsure:60 300 trained:38H+29S wrds:7503 fp:0 fn:0 unsure:67 400 trained:41H+32S wrds:8503 fp:0 fn:0 unsure:73 500 trained:45H+38S wrds:8887 fp:0 fn:0 unsure:83 600 trained:48H+39S wrds:9010 fp:0 fn:0 unsure:87 700 trained:57H+41S wrds:9484 fp:0 fn:0 unsure:98 800 trained:63H+43S wrds:9837 fp:0 fn:0 unsure:106 900 trained:63H+45S wrds:9936 fp:0 fn:0 unsure:108 1000 trained:67H+45S wrds:10001 fp:0 fn:0 unsure:112 1100 trained:72H+47S wrds:10268 fp:0 fn:0 unsure:119 1200 trained:72H+53S wrds:10386 fp:0 fn:0 unsure:125 1300 trained:77H+56S wrds:11178 fp:0 fn:0 unsure:133 1400 trained:81H+58S wrds:11546 fp:0 fn:0 unsure:139 1500 trained:85H+60S wrds:11734 fp:0 fn:0 unsure:145 1600 trained:87H+62S wrds:12023 fp:0 fn:0 unsure:149 1700 trained:89H+63S wrds:12161 fp:0 fn:0 unsure:152 1800 trained:93H+65S wrds:12287 fp:0 fn:0 unsure:158 1900 trained:93H+68S wrds:12449 fp:0 fn:0 unsure:161 2000 trained:96H+70S wrds:12637 fp:0 fn:0 unsure:166 2100 trained:100H+70S wrds:12742 fp:0 fn:0 unsure:170 2200 trained:103H+72S wrds:12984 fp:0 fn:0 unsure:175 2300 trained:105H+73S wrds:13047 fp:0 fn:0 unsure:178 2400 trained:108H+74S wrds:13220 fp:0 fn:0 unsure:182 2500 trained:111H+78S wrds:13407 fp:0 fn:0 unsure:189 2600 trained:112H+79S wrds:13485 fp:0 fn:0 unsure:191 2700 trained:115H+81S wrds:13647 fp:0 fn:0 unsure:196 2800 trained:118H+84S wrds:13797 fp:0 fn:0 unsure:202 2900 trained:120H+84S wrds:13845 fp:0 fn:0 unsure:204 3000 trained:123H+86S wrds:14131 fp:0 fn:0 unsure:209 fp: Data/Ham/Set2/n05250.txt score:0.9312 3100 trained:128H+87S wrds:14327 fp:1 fn:0 unsure:214 3200 trained:129H+90S wrds:14430 fp:1 fn:0 unsure:218 3300 trained:132H+91S wrds:14633 fp:1 fn:0 unsure:222 3400 trained:133H+93S wrds:14923 fp:1 fn:1 unsure:224 3500 trained:133H+94S wrds:14937 fp:1 fn:1 unsure:225 3600 trained:133H+98S wrds:15023 fp:1 fn:1 unsure:229 3700 trained:135H+102S wrds:15463 fp:1 fn:1 unsure:235 3800 trained:135H+107S wrds:15627 fp:1 fn:1 unsure:240 3900 trained:138H+107S wrds:15786 fp:1 fn:1 unsure:243 4000 trained:140H+111S wrds:15951 fp:1 fn:1 unsure:249 4100 trained:142H+116S wrds:16115 fp:1 fn:1 unsure:256 4200 trained:142H+117S wrds:16124 fp:1 fn:1 unsure:257 4300 trained:143H+122S wrds:16251 fp:1 fn:1 unsure:263 4400 trained:143H+126S wrds:16366 fp:1 fn:1 unsure:267 4500 trained:144H+130S wrds:16434 fp:1 fn:1 unsure:272 4600 trained:144H+134S wrds:16599 fp:1 fn:1 unsure:276 4700 trained:146H+135S wrds:16664 fp:1 fn:1 unsure:279 4800 trained:147H+135S wrds:16682 fp:1 fn:1 unsure:280 4900 trained:149H+138S wrds:16911 fp:1 fn:1 unsure:285 fp: Data/Ham/Set1/n01590.txt score:0.9092 5000 trained:151H+140S wrds:17257 fp:2 fn:1 unsure:288 5100 trained:153H+141S wrds:17390 fp:2 fn:1 unsure:291 5200 trained:155H+142S wrds:17747 fp:2 fn:1 unsure:294 5300 trained:156H+143S wrds:18095 fp:2 fn:1 unsure:296 5400 trained:159H+147S wrds:18205 fp:2 fn:1 unsure:303 5500 trained:160H+147S wrds:18230 fp:2 fn:1 unsure:304 5600 trained:163H+147S wrds:18334 fp:2 fn:1 unsure:307 5700 trained:163H+150S wrds:18410 fp:2 fn:1 unsure:310 5800 trained:165H+150S wrds:18455 fp:2 fn:1 unsure:312 5900 trained:168H+151S wrds:18671 fp:2 fn:1 unsure:316 6000 trained:170H+154S wrds:18764 fp:2 fn:1 unsure:321 6100 trained:170H+155S wrds:18787 fp:2 fn:1 unsure:322 6200 trained:170H+156S wrds:18791 fp:2 fn:1 unsure:323 6300 trained:174H+157S wrds:19095 fp:2 fn:1 unsure:328 6400 trained:176H+161S wrds:19398 fp:2 fn:2 unsure:333 6500 trained:178H+161S wrds:19444 fp:2 fn:2 unsure:335 Total messages 6540 (4800 ham and 1740 spam) Total unsure (including 30 startup messages): 336 (5.1%) Trained on 178 ham and 162 spam fp: 2 fn: 2 Total cost: $89.20 (This is on 3 out of my 10 test directories). Interesting to note so far: * The "Total cost" is much higher than for train-on-all schemes, but it is only due to Unsures; fp and fn are still small. * The database growth doesn't decay with time after a while; it can be described as: nwords = 9200 + 1.6 * nmessages or alternatively: nwords = 5700 + 40 * ntrained ..as can be seen in the attached png's * The training set is almost balanced, even though I scored many more ham than spam * The unsure rate drops over time: 0- 1000: 11.2% (minus 3.0% to be fair) 1000- 2000: 5.4% 2000- 3000: 4.3% 3000- 4000: 4.0% 4000- 5000: 3.9% 5000- 6000: 3.3% Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/
[Rob Hooft]
I just added a testdriver to CVS that simulates your behaviour as I understand it: It will train on the first 30 messages,
I trained on 1 of each at the start. If I were to do it over, I'd start with an empty database <wink>.
plus on all misclassified and all unsure messages.
Since I'm doing this real-time on my live email, I've been training "on the worst" (farthest away from correct) msg that arrives in a batch, then rescoring all the ones that arrived in the batch, then training the worst remaining, ... until all new ham is below ham_cutoff and all new spam above spam_cutoff. I don't know that it matters, just being clear(er). As things turned out, this worst-at-a-time training never managed to push one of the remaining mistakes/unsures into the correct category, *except* for cases where I got more than one copy of a spam from different accounts at the same time. Then it always pushed the copies into scoring near 1.0, since the hapaxes in the training copy are abundant.
It is called "weaktest.py", and uses the good-old-Data/{Sp|H}am hierarchy.
I think we should test its performance at different Options settings.
It may not even be very realistic to training on fp's, as I think in my private E-mail I won't even check the spam folder very thoroughly at all.
But I will (and do), and my primary interest here is to see how bad things can get if a user takes mistake-based training to an extreme. Despite that it's heavily hapax-driven, it appears to do very well when judged by error rate. I've been doing it long enough now, though, that it doesn't do so well subjectively: the Unsures are too often bizarre. For example, I sent a long reply here to Robert Woodland, and the copy I get bock showed up as Unsure, with H=1 and S=0.66. There were a lot of accidental spam hapaxes in that msg! Training on it as ham then eliminated about 30 spam hapaxes (there're now netural, having been seen in one ham and one spam each). So it's no different from my POV than the cases where people have sent me "surprising msgs" in the past, and my carefully trained slice-of-life classifier (regularly trained on a sampling of correctly classified msgs too) at the time had no trouble nailing them as ham or spam, with lots of non-hapax evidence to back it up. IOW, I'm still sticking to what I guessed before I started this: mistake-driven training will appear to work well over the short term, but it's brittle, and is brittle because of its reliance on hapaxes.
Anyway, a default run for me now gives:
100 trained:31H+16S wrds:4203 fp:0 fn:0 unsure:47 200 trained:35H+25S wrds:6997 fp:0 fn:0 unsure:60 300 trained:38H+29S wrds:7503 fp:0 fn:0 unsure:67 400 trained:41H+32S wrds:8503 fp:0 fn:0 unsure:73 500 trained:45H+38S wrds:8887 fp:0 fn:0 unsure:83 600 trained:48H+39S wrds:9010 fp:0 fn:0 unsure:87 700 trained:57H+41S wrds:9484 fp:0 fn:0 unsure:98 800 trained:63H+43S wrds:9837 fp:0 fn:0 unsure:106 900 trained:63H+45S wrds:9936 fp:0 fn:0 unsure:108 1000 trained:67H+45S wrds:10001 fp:0 fn:0 unsure:112 1100 trained:72H+47S wrds:10268 fp:0 fn:0 unsure:119 1200 trained:72H+53S wrds:10386 fp:0 fn:0 unsure:125 1300 trained:77H+56S wrds:11178 fp:0 fn:0 unsure:133 1400 trained:81H+58S wrds:11546 fp:0 fn:0 unsure:139 1500 trained:85H+60S wrds:11734 fp:0 fn:0 unsure:145 1600 trained:87H+62S wrds:12023 fp:0 fn:0 unsure:149 1700 trained:89H+63S wrds:12161 fp:0 fn:0 unsure:152 1800 trained:93H+65S wrds:12287 fp:0 fn:0 unsure:158 1900 trained:93H+68S wrds:12449 fp:0 fn:0 unsure:161 2000 trained:96H+70S wrds:12637 fp:0 fn:0 unsure:166 2100 trained:100H+70S wrds:12742 fp:0 fn:0 unsure:170 2200 trained:103H+72S wrds:12984 fp:0 fn:0 unsure:175 2300 trained:105H+73S wrds:13047 fp:0 fn:0 unsure:178 2400 trained:108H+74S wrds:13220 fp:0 fn:0 unsure:182 2500 trained:111H+78S wrds:13407 fp:0 fn:0 unsure:189 2600 trained:112H+79S wrds:13485 fp:0 fn:0 unsure:191 2700 trained:115H+81S wrds:13647 fp:0 fn:0 unsure:196 2800 trained:118H+84S wrds:13797 fp:0 fn:0 unsure:202 2900 trained:120H+84S wrds:13845 fp:0 fn:0 unsure:204 3000 trained:123H+86S wrds:14131 fp:0 fn:0 unsure:209 fp: Data/Ham/Set2/n05250.txt score:0.9312 3100 trained:128H+87S wrds:14327 fp:1 fn:0 unsure:214 3200 trained:129H+90S wrds:14430 fp:1 fn:0 unsure:218 3300 trained:132H+91S wrds:14633 fp:1 fn:0 unsure:222 3400 trained:133H+93S wrds:14923 fp:1 fn:1 unsure:224 3500 trained:133H+94S wrds:14937 fp:1 fn:1 unsure:225 3600 trained:133H+98S wrds:15023 fp:1 fn:1 unsure:229 3700 trained:135H+102S wrds:15463 fp:1 fn:1 unsure:235 3800 trained:135H+107S wrds:15627 fp:1 fn:1 unsure:240 3900 trained:138H+107S wrds:15786 fp:1 fn:1 unsure:243 4000 trained:140H+111S wrds:15951 fp:1 fn:1 unsure:249 4100 trained:142H+116S wrds:16115 fp:1 fn:1 unsure:256 4200 trained:142H+117S wrds:16124 fp:1 fn:1 unsure:257 4300 trained:143H+122S wrds:16251 fp:1 fn:1 unsure:263 4400 trained:143H+126S wrds:16366 fp:1 fn:1 unsure:267 4500 trained:144H+130S wrds:16434 fp:1 fn:1 unsure:272 4600 trained:144H+134S wrds:16599 fp:1 fn:1 unsure:276 4700 trained:146H+135S wrds:16664 fp:1 fn:1 unsure:279 4800 trained:147H+135S wrds:16682 fp:1 fn:1 unsure:280 4900 trained:149H+138S wrds:16911 fp:1 fn:1 unsure:285 fp: Data/Ham/Set1/n01590.txt score:0.9092 5000 trained:151H+140S wrds:17257 fp:2 fn:1 unsure:288 5100 trained:153H+141S wrds:17390 fp:2 fn:1 unsure:291 5200 trained:155H+142S wrds:17747 fp:2 fn:1 unsure:294 5300 trained:156H+143S wrds:18095 fp:2 fn:1 unsure:296 5400 trained:159H+147S wrds:18205 fp:2 fn:1 unsure:303 5500 trained:160H+147S wrds:18230 fp:2 fn:1 unsure:304 5600 trained:163H+147S wrds:18334 fp:2 fn:1 unsure:307 5700 trained:163H+150S wrds:18410 fp:2 fn:1 unsure:310 5800 trained:165H+150S wrds:18455 fp:2 fn:1 unsure:312 5900 trained:168H+151S wrds:18671 fp:2 fn:1 unsure:316 6000 trained:170H+154S wrds:18764 fp:2 fn:1 unsure:321 6100 trained:170H+155S wrds:18787 fp:2 fn:1 unsure:322 6200 trained:170H+156S wrds:18791 fp:2 fn:1 unsure:323 6300 trained:174H+157S wrds:19095 fp:2 fn:1 unsure:328 6400 trained:176H+161S wrds:19398 fp:2 fn:2 unsure:333 6500 trained:178H+161S wrds:19444 fp:2 fn:2 unsure:335 Total messages 6540 (4800 ham and 1740 spam) Total unsure (including 30 startup messages): 336 (5.1%) Trained on 178 ham and 162 spam fp: 2 fn: 2 Total cost: $89.20
(This is on 3 out of my 10 test directories).
Interesting to note so far: * The "Total cost" is much higher than for train-on-all schemes, but it is only due to Unsures; fp and fn are still small.
That matches my experience too, although I started with 1 ham and 1 spam and had high FP and FN rates over the first few hours.
* The database growth doesn't decay with time after a while; it can be described as: nwords = 9200 + 1.6 * nmessages or alternatively: nwords = 5700 + 40 * ntrained ..as can be seen in the attached png's
I expect that's mostly because there are still (relatively) few total msgs trained on.
* The training set is almost balanced, even though I scored many more ham than spam
Curiously, same here! I get about 500 ham and 100 spam per day, but my training database now has 47 ham and 41 spam. It does well, except when it sucks <wink>.
* The unsure rate drops over time:
I haven't measured that, but it's clearly been so here too (as I said before).
0- 1000: 11.2% (minus 3.0% to be fair) 1000- 2000: 5.4% 2000- 3000: 4.3% 3000- 4000: 4.0% 4000- 5000: 3.9% 5000- 6000: 3.3%
Proving what I've always suspected: over time, all msgs are repetitions of ones you've seen before <0.9 wink>.
Tim Peters wrote:
[Rob Hooft]
I just added a testdriver to CVS that simulates your behaviour as I understand it: It will train on the first 30 messages,
I trained on 1 of each at the start. If I were to do it over, I'd start with an empty database <wink>.
This is easy enough to change, but I left it at 30 for now.
Since I'm doing this real-time on my live email, I've been training "on the worst" (farthest away from correct) msg that arrives in a batch, then rescoring all the ones that arrived in the batch, then training the worst remaining, ... until all new ham is below ham_cutoff and all new spam above spam_cutoff. I don't know that it matters, just being clear(er). As things turned out, this worst-at-a-time training never managed to push one of the remaining mistakes/unsures into the correct category, *except* for cases where I got more than one copy of a spam from different accounts at the same time. Then it always pushed the copies into scoring near 1.0, since the hapaxes in the training copy are abundant.
But I'm doing exactly the same, except that my batch size is always 1 ;-)
It may not even be very realistic to training on fp's, as I think in my private E-mail I won't even check the spam folder very thoroughly at all.
But I will (and do), and my primary interest here is to see how bad things can get if a user takes mistake-based training to an extreme. Despite that it's heavily hapax-driven, it appears to do very well when judged by error rate.
Hm. There are so little fp/fn's relative to unsures (at least after 30 messages initial training), that it wouldn't matter much (I think).
* The database growth doesn't decay with time after a while; it can be described as: nwords = 9200 + 1.6 * nmessages or alternatively: nwords = 5700 + 40 * ntrained ..as can be seen in the attached png's
I expect that's mostly because there are still (relatively) few total msgs trained on.
Hm, it is more like a sqrt after more messages. See attached image which has a sqrt X axis. The fit fits the data even at the lowest end. Regards, Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/
[Tim]
... my primary interest here is to see how bad things can get if a user takes mistake-based training to an extreme. Despite that it's heavily hapax-driven, it appears to do very well when judged by error rate.
[Rob Hooft]
Hm. There are so little fp/fn's relative to unsures (at least after 30 messages initial training), that it wouldn't matter much (I think).
As I tried to explain later, the psychological impact of the Unsures isn't attractive, though -- they remain bizarre to human eyes. When I got up today, I got 6 new Unsure spam: human growth hormone, gay porn, life insurance, mortgage rates, a msg that made no sense (empty except for a Yahoo auto-generated sig), and Genuine Leather Jackets. It's not picking up on general "this is advertising" clues, or even on general "this is gay porn" clues. Indeed, "XXX" is still a hapax! This particular HGH spam will never get through again, because training it found 80(!) hapaxes unique to it. It's not going to do much to stop other HGH spam, though -- this one was especially chatty, and added words like 'forget', 'hair', 'lose', 'lost' and 'anywhere' to the collection of (what are now, after training on it) spam hapaxes -- just as previous HGH spam trained on didn't stop this one. To my eyes, I had already told it about HGH spam, and I'm irked that it showed me another one. Ditto gay porn, ditto life insurance, etc. [on database growth as a function of # of msgs]
Hm, it is more like a sqrt after more messages. See attached image which has a sqrt X axis. The fit fits the data even at the lowest end.
Cool! That was a dramatic graph indeed. Soon there will be no mysteries remaining <wink>.
[Rob Hooft]
... It may not even be very realistic to training on fp's, as I think in my private E-mail I won't even check the spam folder very thoroughly at all.
FYI, here's my base weaktest run: Total messages 6800 (4000 ham and 2800 spam) Total unsure (including 30 startup messages): 124 (1.8%) Trained on 57 ham and 68 spam fp: 1 fn: 0 Total cost: $34.80 Flex cost: $193.3770 Here's the same thing, but even weaker, fiddling the code *not* to train on false positives (so the only ham ever trained on is however much appeared in the first 30 startup msgs, and later Unsure ham): Total messages 6800 (4000 ham and 2800 spam) Total unsure (including 30 startup messages): 123 (1.8%) Trained on 57 ham and 66 spam fp: 1 fn: 0 Total cost: $34.60 Flex cost: $199.3106 And one more time, not only not training on FP, but starting with an empty database (no startup msgs). Total messages 6800 (4000 ham and 2800 spam) Total unsure (NO startup messages): 123 (1.8%) Trained on 57 ham and 67 spam fp: 4 fn: 1 Total cost: $65.60 Flex cost: $174.5831 All four FP were among the first 30. Since even my sisters <wink> could be talked into training on 10 msgs at the start: Total messages 6800 (4000 ham and 2800 spam) Total unsure (10 startup messages): 115 (1.7%) Trained on 50 ham and 66 spam fp: 0 fn: 1 Total cost: $24.00 Flex cost: $124.9315 Now for another extreme: after 10 startup msgs, the system trains itself on its own decisions, except that: 1. Unsures are correctly classified by the user. 2. False negatives are correctly classified by the user. But false positives are trained on *as spam*, assuming the user never looks at their spam folder. That takes a long time to run, because update_probabilities() is called after every msg. After 2,100 msgs, 2100 trained:1181H+919S wrds:59659 fp:0 fn:0 unsure:26 and the unsures are growing very slowly now (at 1400 msgs there were 25 unsures). So one more twist: as above (train on self-decisions, but spam below spam_cutoff is corrected by the user, and FP gets trained on as spam), but only update probabilities for each of the first 50 msgs, and every 50th msg thereafter: at 2,100 msgs, it was up to 29 unsure. At the end, Total messages 6800 (4000 ham and 2800 spam) Total unsure (10 startup messages): 48 (0.7%) Trained on 4000 ham and 2800 spam fp: 0 fn: 0 Total cost: $9.60 Flex cost: $104.3355 It would have been more interesting had there been an FP, eh? One conclusion is that, so far as error rates go, on this data it doesn't much matter how training is done, but by any cost measure lots of training is better than little (due to unsures).
Tim Peters wrote:
Now for another extreme: after 10 startup msgs, the system trains itself on its own decisions, except that:
1. Unsures are correctly classified by the user. 2. False negatives are correctly classified by the user.
But false positives are trained on *as spam*, assuming the user never looks at their spam folder. That takes a long time to run, because update_probabilities() is called after every msg. After 2,100 msgs,
2100 trained:1181H+919S wrds:59659 fp:0 fn:0 unsure:26
and the unsures are growing very slowly now (at 1400 msgs there were 25 unsures).
Now THIS is the way I'd like to go! I think this is approximately the minimum effort we can expect from lazy users (like myself). Sometimes, a fp might actually be corrected by the user at some point, but testing it the way you did should be giving the minimal possible performance of a minimal-impact system that would not require much training to begin with. There is one catch: what if the first 10 messages are all ham or all spam? Shouldn't we require at least a few of each? How would this work to start on a mailing list? I guess we could deliver spambayes with 5 "representative recent spam" (or a URL where they can be found). The mailing list would moderate the first few messages to the list, and then the filter will kick in. If a message is "spam", it can be returned to the sender, saying that the message has been judged inappropriate by the filter based on wording. "ham" can be posted without moderator approval. And all "unsure" messages are held for approval. The approval interface could have a separate "Spam" classification, but that is not really necessary: anything "inappropriate" can go in the spam corpus. For "fn"s, the archives should have the options to delete a message as spam. For now my MUA is so badly integrated that I have yet to train a second time.... Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/
Hi, Can someone define what is an hapaxe !
Scores remain grossly hapax-driven, but that's actually enough to classify most of my email correctly: a small number of subjects and senders and mailing lists overwhelmingly dominate my ham mix, and one email account accounts for the vast bulk of my spam. Removing the hapaxes from the database dropped the # of words from 5500 to about 1700. Rescoring the inbox with this reduced database then pushed about 5% of the msgs back into Unsure.
So (no surprise here) hapaxes are vital with little training data. That also means that as soon as one of those words shows up in the other kind of email, it changes from a strong clue to netural, *provided that* I actually train on the new email. I'm not training now unless there's a mistake/unsure, so the hapaxes remain strong clues (even when they point in the wrong direction). BTW, when there are mistakes/unsures, I'm not training on all of them: as I did when I got up, I train the worst example then rescore, one at a time, until no mistakes/unsures remain.
papaDoc P.S. Someday I will contribute to the code but first I need to learn python.
On Friday 08 November 2002 7:20 am, Tim Peters wrote:
Provided the user has already done a decent amount of training, then as Paul Moore suggested it could even work to trust ham-vs-spam decisions immediately, and let user corrections undo those as needed. A well-trained system should be pretty robust against a few misclassifications over the short term.
For the last two weeks I have been using a setup that uses this type of unsupervised training. I have a procmail filter that sends a copy of all incoming ham and spam to two seperate mailboxes. These mailboxes are used for overnight batch training, then deleted. Messages marked 'Unsure' do not take part in this automatic training. I perform seperate filtering for spam and 'unsure' in my mua. Fo far I am manually inspecting the unsure folder, and manually adding them to the appropriate training mailboxes. Initially about 3% of mails were 'unsure', but this has dropped to less than 1% after 2 weeks. Starting next week I plan to change the mua filtering to treat 'unsure' the same as 'ham', and stop all manual training. It will be interesting to see if the training remains stable.
Some confirmation that classifier does not automatically train on messages that it's sure about... SpamBayes Manager, Training database status, stated "Database only has... 7 good and 388 spam - you should consider performing additional training." Then I received 57 e-mail classified as: 2 ham, 54 uncertain, and 29 spam. SpamBayes Manager, Training database status still stated... "7 good and 388 spam" instead of what I had hoped would be... "9 good and 417 spam". After classifying the 54 uncertain as spam the numbers expectedly went to... "7 good and 442 spam". With the proposed idea the potentially advantageous numbers could have been... "9 good and 471 spam". "Piers Haken" <piersh@friskit.com> wrote in message news:9891913C5BFE87429D71E37F08210CB91839FE@zeus.sfhq.friskit.com... I don't believe you need this. I think that the classifier automatically trains on messages as they arrive (or at least on messages that it's sure about). ...
-----Original Message----- From: Moore, Paul [mailto:Paul.Moore@atosorigin.com] ... One thing I don't see, however, is a means of confirming the classifier's decisions as correct. ...
[Dennis W. Bulgrien]
Some confirmation that classifier does not automatically train on messages that it's sure about... SpamBayes Manager, Training database status, stated
Dennis, Your observations are correct. It appears that the configuration switch for "train on everything" is exposed for use in the POP3PROXY version (all other mailers besides Outlook), but not in the Outlook Plug-In. I would also like to experiment with it, but I use the plug-in like you.
From your numbers, you might consider training some additional ham into your databases. Other people have found poor results when the number of spam and ham messages trained is drastically different.
-- Seth Goodman Humans: personal replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above
Thanks. I now understand a difference in POP3PROXY and Outlook. I too would REALLY like the configuration switch for additional training to be put into the Outlook plug-in. I'll wait patiently in great anticipation... "Seth Goodman" <nobody@spamcop.net> wrote in message news:MHEGIFHMACFNNIMMBACAEEMMGAAA.nobody@spamcop.net... ... Your observations are correct. It appears that the configuration switch for "train on everything" is exposed for use in the POP3PROXY version (all other mailers besides Outlook), but not in the Outlook Plug-In. ...
participants (10)
-
Dennis W. Bulgrien -
Mark Hammond -
papaDoc -
Paul Moore -
Piers Haken -
Rob Hooft -
Rob W.W. Hooft -
Seth Goodman -
Tim Peters -
Toby Dickenson