Training corrupts mbox files
I've been using spambayes for a couple of months now, and its results are spectacular. On my email setup, it easily catches 300 spams out of a total 400 messages each day, with virtually no false positives or negatives. I love it! My only problem with it is that it seems to trash my mbox files when I train it. I use the following training command to train it on ham and spam mboxes: mboxtrain.py -d $HOME/.hammiedb -g $HOME/mail_processing/caught/bayes_good -s $HOME/mail_processing/caught/spam It correctly learns the messages, but the two mbox files have a bunch of erroneous "messages" at the end, and opening the mbox up in mutt gives a series of errors concerning invalid uid sequences. Has anyone else had problems training spambayes on mbox files? Is there anything else I should be doing to prevent spambayes from rewriting the mbox file? It if helps, I can post a sample of the before and after mbox files to a webpage for perusal. Thanks, -- David McLaughlin david@dsmcl.net
David McLaughlin <david@dsmcl.net> writes:
Has anyone else had problems training spambayes on mbox files? Is there anything else I should be doing to prevent spambayes from rewriting the mbox file? It if helps, I can post a sample of the before and after mbox files to a webpage for perusal.
I wrote that code, and I have to confess I only tested it on a couple of mboxes. Go ahead and post your samples and I'll see if I can fix it. Neale
Thanks for taking a look at it! I have put a sample before and after mbox at the following location: ftp://ftp.dsmcl.net/spambayes_samplembox.tgz It looks like it may be duplicating some lines in the header, and adding an extra line break, which generates "extra" bogus mail messages. As an example, here is the original subset of headers:
Begin From ebyfi587wxi@seductive.com Mon Apr 28 15:36:22 2003 Return-Path: <ebyfi587wxi@seductive.com> Delivered-To: dogwood-dogwoodproductions:com-rayn@dogwoodproductions.com From ebyfi587wxi@seductive.com Mon Apr 28 15:36:22 2003 Return-Path: <ebyfi587wxi@seductive.com> Delivered-To: dogwood-dogwoodproductions:com-rayn@dogwoodproductions.com X-Envelope-To: rayn@dogwoodproductions.com <<End
and here is the subset after training:
Begin From ebyfi587wxi@seductive.com Mon Apr 28 15:36:22 2003 Return-Path: <ebyfi587wxi@seductive.com> Delivered-To: dogwood-dogwoodproductions:com-rayn@dogwoodproductions.com X-Spambayes-Trained: spam
From ebyfi587wxi@seductive.com Mon Apr 28 15:36:22 2003 Return-Path: <ebyfi587wxi@seductive.com> Delivered-To: dogwood-dogwoodproductions:com-rayn@dogwoodproductions.com X-Envelope-To: rayn@dogwoodproductions.com <<End
It seems to be trying to add the "X-Spambayes-Trained" line, adding an extra return, and then duplicating the beginning couple of headers again. Not sure. Thanks! -- David McLaughlin david@dsmcl.net
From: Neale Pickett <neale@woozle.org> Date: Sat, Apr 26, 2003 at 12:15:39PM -0700 To: To david@dsmcl.net Subject: Re: [Spambayes] Training corrupts mbox files
David McLaughlin <david@dsmcl.net> writes:
Has anyone else had problems training spambayes on mbox files? Is there anything else I should be doing to prevent spambayes from rewriting the mbox file? It if helps, I can post a sample of the before and after mbox files to a webpage for perusal.
I wrote that code, and I have to confess I only tested it on a couple of mboxes. Go ahead and post your samples and I'll see if I can fix it.
Neale
David McLaughlin <david@dsmcl.net> writes:
Thanks for taking a look at it!
I have put a sample before and after mbox at the following location:
ftp://ftp.dsmcl.net/spambayes_samplembox.tgz
It looks like it may be duplicating some lines in the header, and adding an extra line break, which generates "extra" bogus mail messages.
Yeah, sure enough. You're using mutt? The mbox "standard" is that any line beginning with "From " denotes a new messages. So a diff of those two mailboxes shows things like this: From removed@example.com Mon Apr 28 16:37:18 2003 Return-Path: <removed> Delivered-To: <removed> +X-Spambayes-Trained: spam + From removed@example.com Mon Apr 28 16:37:18 2003 Return-Path: <removed> Delivered-To: <removed> I think spambayes is actually doing the right thing here--it's taking a weird mbox and un-weirding it. I think Tim Stone might be working on a generic message store thingy: Tim, would that eliminate the need to rewrite mailboxes altogether? But David, if I were you I'd start trying to hunt down what's creating those duplicate headers. It might be some sort of wonky procmail recipe that just writes out headers and then drops through, but that's just a shot in the dark guess. Heh, maybe it's hammiefilter <0.7 wink> Neale
4/29/2003 8:59:56 PM, Neale Pickett <neale@woozle.org> wrote:
I think spambayes is actually doing the right thing here--it's taking a weird mbox and un-weirding it. I think Tim Stone might be working on a generic message store thingy: Tim, would that eliminate the need to rewrite mailboxes altogether?
I haven't started looking at the mbox problems yet, but in general it would only eliminate mbox rewriting if you don't want *any* of the spambayes headers added to the messages in the mbox. It can remember, given a message id, how that message was classified, and how it is trained, but that's all at the moment. That would seem to be inadequate for this problem... c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't.
Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> writes:
I haven't started looking at the mbox problems yet, but in general it would only eliminate mbox rewriting if you don't want *any* of the spambayes headers added to the messages in the mbox.
In the case of training on an entire mailbox, that would probably be okay. The mbox format is kind of wonky, so if we can avoid touching it, tant mieux. Neale
Thanks for the suggestion! I did some digging at various stages in my filtering process, to see exactly where those headers were added. It seems an older version of Mail::Audit was buggy -- upgrading to the latest version fixed the problem. Back to training, -- David McLaughlin david@dsmcl.net
But David, if I were you I'd start trying to hunt down what's creating those duplicate headers. It might be some sort of wonky procmail recipe that just writes out headers and then drops through, but that's just a shot in the dark guess. Heh, maybe it's hammiefilter <0.7 wink>
David> It seems an older version of Mail::Audit was buggy -- upgrading David> to the latest version fixed the problem. That's all well and good, but I was thinking perhaps mboxtrain should maintain a little database parallel to its mbox file whose entries are keyed by message-id. It could store its results there and never have to monkey with the mbox file. "-f"orce training would simply be a matter of deleting all keys in that database at the start of the run. (My apologies if this was suggested previously. I hadn't really been paying much attention to this thread, then had occasion to try out mboxtrain for the first time last night. It got me thinking about the problem.) Skip
4/30/2003 5:29:22 PM, Skip Montanaro <skip@pobox.com> wrote:
David> It seems an older version of Mail::Audit was buggy -- upgrading David> to the latest version fixed the problem.
That's all well and good, but I was thinking perhaps mboxtrain should maintain a little database parallel to its mbox file whose entries are keyed by message-id. It could store its results there and never have to monkey with the mbox file. "-f"orce training would simply be a matter of deleting all keys in that database at the start of the run.
Yeah, we're already walking down that path with the messageinfodb that's maintained in message.py. This will certainly need some more work for mbox purposes, but it would be perfect if mboxes never needed to be rewritten. That's the goal, afaic. c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't.
Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> writes:
4/30/2003 5:29:22 PM, Skip Montanaro <skip@pobox.com> wrote:
That's all well and good, but I was thinking perhaps mboxtrain should maintain a little database parallel to its mbox file whose entries are keyed by message-id. It could store its results there and never have to monkey with the mbox file. "-f"orce training would simply be a matter of deleting all keys in that database at the start of the run.
Yeah, we're already walking down that path with the messageinfodb that's maintained in message.py. This will certainly need some more work for mbox purposes, but it would be perfect if mboxes never needed to be rewritten. That's the goal, afaic.
Ditto to what Tim wrote. If mboxtrain keeps somewhere a list of messages it's seen, theres no longer any need to modify the mbox. Neale
On Thursday 01 May 2003 5:22 pm, Neale Pickett wrote:
Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> writes:
4/30/2003 5:29:22 PM, Skip Montanaro <skip@pobox.com> wrote:
That's all well and good, but I was thinking perhaps mboxtrain should maintain a little database parallel to its mbox file whose entries are keyed by message-id. It could store its results there and never have to monkey with the mbox file. "-f"orce training would simply be a matter of deleting all keys in that database at the start of the run.
Yeah, we're already walking down that path with the messageinfodb that's maintained in message.py. This will certainly need some more work for mbox purposes, but it would be perfect if mboxes never needed to be rewritten. That's the goal, afaic.
Ditto to what Tim wrote. If mboxtrain keeps somewhere a list of messages it's seen, theres no longer any need to modify the mbox.
fwiw, I stopped using mboxtrain and its incremental mode in favor of hammie, and always doing a full train on whole mailboxes. Its not significantly slower.
Toby> fwiw, I stopped using mboxtrain and its incremental mode in favor Toby> of hammie, and always doing a full train on whole mailboxes. Its Toby> not significantly slower. How big are your mailboxes? I have about 12,000 hams and 7,000 spams in my training sets, so I generally avoid full retrains. I'm considering a somewhat different procmail-based setup for some other people, however, in which they would have three email addresses, foo@somewhere, foo+spam@somewhere and foo+ham@somewhere. The last two would (obviously) be for training. My thought was to simply have the training aliases append to mbox files and run mboxtrain from cron periodically. I'd logrotate the training files to keep the number of files and their sizes to a minimum. Someone else must already be doing something like this. Care to share? Skip
On Thursday 01 May 2003 6:21 pm, Skip Montanaro wrote:
Toby> fwiw, I stopped using mboxtrain and its incremental mode in favor Toby> of hammie, and always doing a full train on whole mailboxes. Its Toby> not significantly slower.
How big are your mailboxes? I have about 12,000 hams and 7,000 spams in my training sets, so I generally avoid full retrains.
I'm considering a somewhat different procmail-based setup for some other people, however, in which they would have three email addresses, foo@somewhere, foo+spam@somewhere and foo+ham@somewhere. The last two would (obviously) be for training. My thought was to simply have the training aliases append to mbox files and run mboxtrain from cron periodically. I'd logrotate the training files to keep the number of files and their sizes to a minimum.
Someone else must already be doing something like this. Care to share?
I am using kmail with approximately 40 folders (mailboxes). I am training directy from the kmail folders. That means I dont need duplicate copies of emails in a seperate training database, I can use the normal kmail gui for adjusting the training sets, and ensures that training doesnt use ancient emails. I use kmail to delete personal emails after 6 months, mailing lists after a few weeks, and spams after a year. That keeps the total content stable at about 6000 hams and 800 spams. I train overnight from cron, and it takes about 5 minutes. From memory, incremental mboxtrain was taking about 4 minutes with a lower cpu usage. I have a script that generates a long hammie.py command line by parsing the kmail configuration file. It assumes that: - the folder called "spam" and all its subfolders are spam training material - "trash" and "drafts" should be ignored - every other folder contains ham training material. I use procmail to run the hammie filter to add the headers during mail delivery. kmail filters are used to sort incoming mail: spam into a seperate folder. (for a while my wife was using the same setup, but running the hammie filter from kmail. No procmail needed) I use two folders for spam..... spam/archive and spam/new. kmail filters the spam into spam/new and marks it read. Every week I review spam/new for false positives (Im still waiting for my first!), then empty it into spam/archive. Any interest in better documentation of this setup? -- Toby Dickenson http://www.geminidataloggers.com/people/tdickenson
Skip Montanaro <skip@pobox.com> writes:
Someone else must already be doing something like this. Care to share?
I'm doing something similar to Toby. I have a few guinea pig users doing this currently, and eventually everyone will be doing this. Each IMAP user gets three new IMAP folders: spam, spamfairy.spam, and spamfairy.ham. All incoming mail gets procmail filtered through hammiefilter and filed into inbox or spam. When the user sees something that's misfiled, they have to move it to the appropriate spamfairy folder. Every night the Spam Fairy checks under their pillow for new email and trains on it (this is done with hammiefilter). She then moves these messages into either the inbox or the spam folder, depending on which spamfairy folder she's currently visiting. This seems to work pretty well for my two trial users. If it turns out that it's viable, I'd be happy to provide the spamfairy script. (It doesn't run system-wide yet.) Neale
If mboxtrain keeps somewhere a list of messages it's seen, theres no longer any need to modify the mbox.
by freeing the training materials from the source documents, wouldn't this also create the opportunity for implementing an automated 'freshness' mechanism whereby 'pruning' of said materials (based on age, etc.) would be possible? b
bill parducci <bill@parducci.net> writes:
by freeing the training materials from the source documents, wouldn't this also create the opportunity for implementing an automated 'freshness' mechanism whereby 'pruning' of said materials (based on age, etc.) would be possible?
Quite likely, yes, that would be possible.
participants (7)
-
bill parducci -
David McLaughlin -
Neale Pickett -
Skip Montanaro -
Tim Stone - Four Stones Expressions -
Toby Dickenson -
Toby Dickenson