Outlook plugin - training
When the Outlook plugin filters mails, it classifies them as either spam or potential spam, and can put them in appropriate folders. In the spam/potential spam folders, there is a "Recover from Spam" button available, and in other folders there is a "Delete as spam" button. These buttons add the message to the training database as well as taking the appropriate action. One thing I don't see, however, is a means of confirming the classifier's decisions as correct. A "yes, that is spam" button for the spam folder, and a "yes, that's ham" button in non-spam folders. As I'm starting from a very small message base, I worry that correct classifications are still somewhat based on "luck", and training based on correct decisions would help to increase both my and the classifier's confidence level. I can do this by regular retraining, but that has 2 disadvantages: it's much clumsier than simply clicking on a "clever boy!" button, and it relies on me not deleting messages until I do a training run. Much of the ham I get is "read and forget", so I'd rather delete immediately. When I get a chance to dive into the code, I'll see how hard this would be to implement. Paul.
[Moore, Paul]
... I can do this by regular retraining, but that has 2 disadvantages: it's much clumsier than simply clicking on a "clever boy!" button, and it relies on me not deleting messages until I do a training run. Much of the ham I get is "read and forget", so I'd rather delete immediately.
When I get a chance to dive into the code, I'll see how hard this would be to implement.
Automatic training needs lots of work. The Outlook client has gotten smarter than anything else about this so far, but at the moment it's basically automating "mistake based" training, which I think will prove to be a Bad Idea over time. Ideal is to train regularly on a random sample of all msgs, whether or not correctly classified (I fake this by hand for now). That presents some UI and algorithmic challenges. It will also create a database size problem: without a strategy for pruning useless words, the database will grow without bounds (an intuition that at a certain non-fantastic size, "all words" will have been seen is incorrect for computer-based indexing apps, and especially for email -- unique words keep appearing and keep bloating the beast). There's been no research done here yet on how to prune a database over time without damaging accuracy.
Tim Peters <tim.one@comcast.net> wrote:
It will also create a database size problem: without a strategy for pruning useless words, the database will grow without bounds (an intuition that at a certain non-fantastic size, "all words" will have been seen is incorrect for computer-based indexing apps, and especially for email -- unique words keep appearing and keep bloating the beast).
Did you actually find this? I found the growth tailed off dramatically after not too long. I no longer have the exact numbers, but database growth for me tailed off almost to nothing after I had trained on something like 1500 messages. Charles -- ----------------------------------------------------------------------- Charles Cazabon <python-spambayes@discworld.dyndns.org> GPL'ed software available at: http://www.qcc.ca/~charlesc/software/ -----------------------------------------------------------------------
[Tim]
It will also create a database size problem: without a strategy for pruning useless words, the database will grow without bounds
[Charles Cazabon]
Did you actually find this?
Yes.
I found the growth tailed off dramatically after not too long.
That too -- the second derivative is negative from the start, but the first remains positive. "It's like" log that way, growing ever more slowly, but inexorably.
I no longer have the exact numbers, but database growth for me tailed off almost to nothing after I had trained on something like 1500 messages.
When I run my c.l.py test, 10 classifiers are built each training on about 30,000 msgs. The classifier pickles hug 18MB each then. My classifier at work has been trained on about 1,100 msgs, and its classifier pickle is about 2MB. My classifier at home has been trained on about 3,000 msgs, and its classifier pickle is about 4MB. That last one is from memory, so when I get home I'll make up a different number so that the three points exactly fit a log curve <wink>. Nobody has used this system long enough under a high enough daily load yet to get frantic about database bloat, but the people who have run very large tests must all be aware that it's inevitable (without pruning). I've already noticed the increase in startup time on my home box, due to loading a bigger pickle every day.
Tim Peters wrote Automatic training needs lots of work. The Outlook client has gotten smarter than anything else about this so far, but at the moment it's basically automating "mistake based" training, which I think will prove to be a Bad Idea over time.
Ideal is to train regularly on a random sample of all msgs, whether or not correctly classified (I fake this by hand for now). That presents some UI and algorithmic challenges.
Note that "random sample" is not as trivial as all that, either - if you have a very high ham:spam ratio in your training DB, your accuracy will suffer (see the tests from Alex, myself and others). An easy example of this is those of us who are on a bunch of higher volume python.org lists - Greg's sterling work there means that very little spam gets through there. As spambayes takes over the world, this could be a larger problem. Anthony -- Anthony Baxter <anthony@interlink.com.au> It's never too late to have a happy childhood.
[Anthony Baxter]
Note that "random sample" is not as trivial as all that, either - if you have a very high ham:spam ratio in your training DB, your accuracy will suffer (see the tests from Alex, myself and others).
I still need to try to make sense of those tests. A real complication is that more than one thing changes when trying to test ratios: it's not just the ratio that changes, it's the absolute number of each trained on too. For example, (a) train on 5000 ham and 1000 spam; or, (b) train on 50000 ham and 10000 spam. The ratios are identical. Do we expect the error rates to be identical too? I don't, but haven't tried it. I expect the latter would do better than the former, despite the identical ratios, simply because more msgs allow better spamprob estimates. Something missing in "the ratio tests" is a rationale (even an after-the-fact one) for believing there's some aspect of the system that's sensitive to the ratio. The combining method certainly is not, and the spamprob estimation (update_probabilities()) deliberately works with percentages instead of raw counts so that the ham::spam training ratio has no direct effect on the spamprobs calculated.
An easy example of this is those of us who are on a bunch of higher volume python.org lists - Greg's sterling work there means that very little spam gets through there.
The total # of spam training msgs does limit how high a spamprob can get, and the total # of ham training msgs limits how low. The *suspicion* I had running my large c.l.py test is that it wasn't the ratio that mattered so much as the absolute number, and that the error rates didn't "settle down" to the 4th digit until I got near 10,000 spam total.
As spambayes takes over the world, this could be a larger problem.
Despite all the above <wink>, when faking "random sample" by hand in my personal classifiers, I see I've *ended up* aiming for about an equal number of each in my training data. That works well too (for me, and anecdotally -- these aren't controlled experiments).
This is a wonderful idea! It would be terrific if a check box were added to the SpamBayes Manager, Advanced tab, which would cause automatic training on all ham left in InBox and all spam put in the spam folder. I emphasize something Mark Hammond said, "underestimating our own tool". If an e-mail is over a 90% spam limit, at 91%, the 9% that isn't classified as spam may contain clues as to how the SPAMming community is trying to outwit the population. For instance, the words "buy v iagra online" (the space is to help THIS email not to be classified as spam :-) is in the database. The SPAMmer sends "buy v.iagra online" (have you seen that technique?) which, lets assume, the classifier scores at 91% due to "buy" and "online". By automatically training on this correctly categorized email, the word "v.iagra" gets added to the database. Then if the SPAMmer tries to send "b.uy v.iagra o.nline" the classifier will recognize "v.iagra" from last time and have an advantage. Its morphs with changes in the black industry, sort of like artificial intelligence. "Delete As Spam" and "Recover from Spam" buttons correct any error; for those who don't care to check the spam folder, the feature can remain disabled. "Moore, Paul" <Paul.Moore@atosorigin.com> wrote in message news:16E1010E4581B049ABC51D4975CEDB88619926@UKDCX001.uk.int.atosorigin.com... ... As I'm starting from a very small message base, I worry that correct classifications are still somewhat based on "luck", and training based on correct decisions would help to increase both my and the classifier's confidence level. ...
[Dennis W. Bulgrien]
"Moore, Paul" <Paul.Moore@atosorigin.com> wrote in message
news:16E1010E4581B049ABC51D4975CEDB88619926@UKDCX001.uk.int.atosorigin.com.. . What newsgroup was this in? I can't retrieve the message from the link (perhaps truncated?). I'm extremely intrigued by the possibilities of continuous training. Maybe it works better, maybe it doesn't. Does anyone have any experiences in this regard? In any case, with continuous training comes a continuously growing database. Mistakes in classification will stay in the database forever, as will forms of spam that are no longer common. I don't *know* that this is a serious problem, but intuition says it won't help anything. A smaller database should also learn faster than a larger one. I have put some ideas up on the SpamBayes Wiki at http://entrian.com/sbwiki/TrainingIdeas concerning automatic pruning of database entries for use with continuous training. I encourage you, anyone else who shares this interest and in particular any of the developers, to add comments to the Wiki, comment to the mailing list or comment to me off-line. I am willing to put work into this, write code and experiment, but I have no desire to waste time hashing out ideas that have already been explored before. Thanks in advance. -- Seth Goodman Humans: personal replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above
None other, sorry. The "newsgroup" reference is to none other than this e-mailing list which is being recorded on news://news.gmane.org/gmane.mail.spam.spambayes.general . There you can see the large e-mail discussion history. "Seth Goodman" <nobody@spamcop.net> wrote in message...
"Moore, Paul" <Paul.Moore@atosorigin.com> wrote in message news:16E1010E4581B049ABC51D4975CEDB88619926@UKDCX001.uk.int.atosorigin.com..
participants (7)
-
Anthony Baxter -
Charles Cazabon -
Dennis W. Bulgrien -
Moore, Paul -
Seth Goodman -
Tim Peters -
Tim Peters