I had performed the initial training for the utility, and had around 1100 spam examples and 400+ ham examples. Outlook hung, and I had to abnormally terminate it and reboot in order for the computer to function properly. When I opened Outlook again, SpamBayes indicated there were no spam and no ham instances in its database. Unfortunately, I had already deleted my spam examples, and now don't have them for re-training. Is there a way to import that information from a backup file or anything? I am running XP, Outlook XP, and the binary version of the Outlook plugin. My Python version is ActiveState 2.2.4. Thanks in advance, Greg
Given that you know approximately how many of each you had, it's relatively simple to correct. Run dbExpImp.py with no operands to see how to export and import the classifier database. Then export it. Bring the export file up in an editor. The first line will have two numbers in it, both zero. Those numbers are the number of spam and the number of ham, respectively, that are in the database. Change them to 1100 and 400 (or whatever is appropriate), save the file, and then import it. Just for good measure, you might import to a different database than the original. This should correct the problem. 4/7/2003 8:00:59 AM, "Greg Scott" <gregscott@gbsage.com> wrote:
I had performed the initial training for the utility, and had around 1100 spam examples and 400+ ham examples. Outlook hung, and I had to abnormally terminate it and reboot in order for the computer to function properly. When I opened Outlook again, SpamBayes indicated there were no spam and no ham instances in its database. Unfortunately, I had already deleted my spam examples, and now don't have them for re-training.
Is there a way to import that information from a backup file or anything?
I am running XP, Outlook XP, and the binary version of the Outlook plugin. My Python version is ActiveState 2.2.4.
Thanks in advance,
Greg _______________________________________________ Spambayes mailing list Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes
c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't.
I've added a bug for the fact that the addin doesn't save the database after a train operation. http://sourceforge.net/tracker/index.php?func=detail&aid=717253&group_id=617 02&atid=498103 For the spambayes crowd: Doing this for a "pickle" database would be prohibitive. Now that there is a bsddb3 that works for Windows, how would people feel about me dropping all pickle support from the plugin? This will require you to install bsddb3 (or Python 2.3) and do a full re-train. As I have mentioned before, I would also be happy to accept patches that do an automatic migration ;) Any objections? Mark.
-----Original Message----- From: spambayes-bounces@python.org [mailto:spambayes-bounces@python.org]On Behalf Of Greg Scott Sent: Monday, 7 April 2003 11:01 PM To: spambayes@python.org Subject: [Spambayes] Lost database
I had performed the initial training for the utility, and had around 1100 spam examples and 400+ ham examples. Outlook hung, and I had to abnormally terminate it and reboot in order for the computer to function properly. When I opened Outlook again, SpamBayes indicated there were no spam and no ham instances in its database. Unfortunately, I had already deleted my spam examples, and now don't have them for re-training.
Is there a way to import that information from a backup file or anything?
I am running XP, Outlook XP, and the binary version of the Outlook plugin. My Python version is ActiveState 2.2.4.
Thanks in advance,
Greg _______________________________________________ Spambayes mailing list Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes
[Mark Hammond]
I've added a bug for the fact that the addin doesn't save the database after a train operation.
http://sourceforge.net/tracker/index.php?func=detail&aid=717253&gr oup_id=61702&atid=498103
For the spambayes crowd: Doing this for a "pickle" database would be prohibitive.
Not for my pickle databases: they're under 2MB, and I rarely train on anything anymore.
Now that there is a bsddb3 that works for Windows, how would people feel about me dropping all pickle support from the plugin? This will require you to install bsddb3 (or Python 2.3) and do a full re-train. As I have mentioned before, I would also be happy to accept patches that do an automatic migration ;)
Any objections?
Me! I'd like to hold off on that until Python 2.3 final is released, as I don't want to encourage people to install an alpha Python (which 2.3 still is; only Python (not spambayes) alpha testers should be using anything later than Python 2.2.2).
For the spambayes crowd: Doing this for a "pickle" database would be prohibitive.
Not for my pickle databases: they're under 2MB, and I rarely train on anything anymore.
But this would still mean that every unsure you bothered to hit the "recover" or "delete" button on would require writing the 2MB pickle. I believe we can afford this hit with a bsddb style database.
Me! I'd like to hold off on that until Python 2.3 final is released, as I don't want to encourage people to install an alpha Python (which 2.3 still is; only Python (not spambayes) alpha testers should be using anything later than Python 2.2.2).
And requiring bsddb3 to be installed is too much of a burden? But yeah, either way, having the code not flush-on-train for pickles isn't that big of a deal. Mark.
[Mark Hammond]
... And requiring bsddb3 to be installed is too much of a burden?
Installing it is more bother than not installing it. "Too much" will vary by user. I'm typing on a laptop now with a dialup connection, and almost out of disk space -- it seems like a lot of bother for me on this box right now <wink>, and the pickle-based addin works fine here. I expect bsddb3 will chew up more disk space too (relative to pickles -- have a feel for that? I gave up using Sam Rushing's bsddb 1.85 Windows port years ago due to disk bloat and bugs; I'm told the bugs are fixed in bsddb3, but don't know about disk consumption).
But yeah, either way, having the code not flush-on-train for pickles isn't that big of a deal.
If you feel more strongly about it than I do, go ahead. If the intent is to move to bsddb3 exclusively, then there's a lot to be said for biting that bullet before many more people grow pickle databases.
If you feel more strongly about it than I do, go ahead. If the intent is to move to bsddb3 exclusively, then there's a lot to be said for biting that bullet before many more people grow pickle databases.
Why not move to the DBAPI and let people make their own choice about db backend? I personally think a great deal of sqlite and hope it makes it's way into the core someday (and it's licensing supports that!), especially in comparison to Gadfly. Dave LeBlanc Seattle, WA USA
[Tim1]
If you feel more strongly about it than I do, go ahead.
I don't, but:
If the intent is to move to bsddb3 exclusively, then there's a lot to be said for biting that bullet before many more people grow pickle databases.
is exactly where I was coming from! (FYI, the binaries are all bsddb based, so real <wink> people wont be growing pickles.) So, what I have decided is that I will state publically, and document somewhere that pickles will not be supported long term by Outlook. I will keep the code so long as the cost is small. Next time an incompatible database change happens, drop support. Such a change will ideally involve an automatic "upgrade" from the existing db - but not from existing pickles, so at this time I would declare pickles dead. Hopefully this will be post Python 2.3. Where I am comimg from with the "incompatible database" is my idea for the "message database" next to our "word database", as posted here a couple of months back. I made a start on it, then decided I was being too ambitious and changing too much, so I abandonded it, intending to go back to my original "slightly hacky but less instrusive" plan. Tim2 may recall that this database is what is preventing dbExport from working, rather than the bayes word database. Since then payed work has got in the way. Damn-capitalists <wink> Mark.
4/8/2003 6:20:31 PM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:
Where I am comimg from with the "incompatible database" is my idea for the "message database" next to our "word database", as posted here a couple of months back. I made a start on it, then decided I was being too ambitious and changing too much, so I abandonded it, intending to go back to my original "slightly hacky but less instrusive" plan. Tim2 may recall that this database is what is preventing dbExport from working, rather than the bayes word database. Since then payed work has got in the way.
I am currently hard at work on the "message database." It is being shaken out with the imap filter, and then I'll incorporate it into the notes filter and the pop3proxy. By then it should be fairly solid <wink>. You might want to have a look at it before then and let me know what you think. It's spambayes.message c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't.
I was home sick today, and took the opportunity to look at a small collection of surprising Unsures and exception-raising msgs I've put to the side since last December. Did some checkins, and they all score as solid spam now. The exception-raising msgs were ill-formed MIME that lacked a trailing boundary marker. The email pkg is happy enough with this provided they have at least a trailing blank line, but these didn't even have that much. I wormed around it in the Outlook client only, by catching the distinctive exception and feeding the string back into email.message_from_string() after tacking an empty line onto the end. It would be better if we had a common wrapper around email.message_from_string() so that all clients could benefit from these little hacks. The other was a systematic problem with the way non-comment HTML tags got stripped. Here's the checkin msg: """ I dug into a small collection of Unsures that looked like blatant spam, and discovered they were all using this kind of trick: Wr<!$FS|i|R3$s80sA >inkle Reduc<!$FS|i|R3$s80sA >tion That is, disguising words by inserting HTML nonsense tags. We replaced each tag with a blank, yielding the pretty useless tokens "Wr", "inkle", "Reduc" and "tion". We previously fixed a similar problem using embedded HTML comments. I should have fixed this other one then. Cute: these all scored at the high end of my Unsure range anyway. Now they're all solidly spam. """ That change was to tokenizer.py, and should benefit everyone. I recommend doing a retrain-from-scratch after you update the code, both to purge the useless word fragments that may have accumulated in your database, and to get the actual whole words into it.
Another new checkin just now taught the tokenizer how to decode numeric character entities; here's the checkin msg: """ DIgging into a pile of high-scoring Unsures showed this trick: your se<!XE>ptic system as a way to disguise "your septic system". Bite the bullet and decode numeric character entities. Also replace <p> and <br> tags with single blanks, since browsers break text visually when they see one of these. """ I found this common in "septic tank", "Russian women want to marry you", and "accept credit cards" spam. Against my database, these were scoring as low spam or high unsure (at spam_cutoff 0.8). Most score as high spam now, and without training on them. although-nothing-makes-the-system-faster<wink>-ly y'rs - tim
4/7/2003 9:16:32 PM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:
Now that there is a bsddb3 that works for Windows, how would people feel about me dropping all pickle support from the plugin? This will require you to install bsddb3 (or Python 2.3) and do a full re-train.
A full retrain can be avoided by exporting and importing using dbExpImp.py id est: dbExpImp -e -d apickledb -f dbexportfile dbExpImp -i -D absddbdb -f dbexportfile c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't.
4/7/2003 11:18:39 PM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:
A full retrain can be avoided by exporting and importing using dbExpImp.py
Oh ya... forgot. I should fix that. Can you send me an Outlook pickle database so I can give it a go? c'est moi - TimS http://www.fourstonesExpressions.com http://wecanstopspam.org There are 10 kinds of people in the world: those who understand binary, and those who don't.
On Tue, 8 Apr 2003 12:16:32 +1000, "Mark Hammond" <mhammond@skippinet.com.au> wrote:
Now that there is a bsddb3 that works for Windows, how would people feel about me dropping all pickle support from the plugin? This will require you to install bsddb3 (or Python 2.3) and do a full re-train. As I have mentioned before, I would also be happy to accept patches that do an automatic migration ;)
I'm in favour. I don't usually like the idea of adding an extra dependency, but Outlook shutdown times are so much better with bsddb3 that I don't see why anyone would still want to use a pickle. David (just waiting for my work PC to retrain - you reminded me that I forgot to install bsddb3 when I set up my new PC last week :-)
participants (7)
-
David LeBlanc -
David Leftley -
Greg Scott -
Mark Hammond -
Tim Peters -
Tim Peters -
Tim Stone - Four Stones Expressions