[spambayes-bugs] [ spambayes-Bugs-826954 ] Database corruption after multiple trainings

SourceForge.net noreply at sourceforge.net
Mon Oct 20 11:59:50 EDT 2003


Bugs item #826954, was opened at 2003-10-20 10:59
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=826954&group_id=61702

Category: imapfilter
Group: Source code 1.0a6
Status: Open
Resolution: None
Priority: 5
Submitted By: Jacob Farmer (jpfarmer)
Assigned to: Tony Meyer (anadelonbrin)
Summary: Database corruption after multiple trainings

Initial Comment:
I've made a few posts about this to the Spambayes 
mailing list and I'm going to paste those messages for 
reference.  However, in summary: regardless of what 
data format I use, after several trainings using the -t 
flag, my database becomes corrupted.  I've been able to 
reproduce this using both a Pickle and Bsddb[3].  Each 
time, if I remove the DB and retrain from scratch, there 
isn't a problem.  Also, if I just classify, I don't have any 
corruption problems (that is, if I just train once and 
after that never train again).  The training always 
completes, and when it moves onto classifying, I get an 
assertion error.

The messages from the mailing list follow.  Below the 
message, I've included a sample sb_imapfilter session 
transcript.

----------------------------------------
Messages from mailing list:
----------------------------------------
>From  "Tony Meyer" <tameyer at xxxx.xx.xx> 
Subject  RE: [Spambayes] Serious Database Corruption 
Problems 
Date  Wed, October 15, 2003 6:18 pm 
To  
leotune at xxxxxxxxxxxxxxxxxx.xxx,spambayes at xxxxxx.xxx 

------------------------------------------------------
--------------------------


> I'm having a lot of trouble with what I think is 
database corruption. 
> I've included the output I get from the program 
before, but 
> from what I've read, an assertion error usually means 
the database is
dead.

Yes - what this is saying is that you have a token that 
has appeared in more
spam than you have trained, which is obviously 
impossible.

> As the FAQ suggests, I've tried both Bsddb[3] and 
Pickle formats, but
> after a few trainings, I always get this error.  If I 
delete 
> my databases and start over, then I'm fine for a few 
additional trainings,

> but the same thing happens.

It's very strange that this happens with a pickle.  To 
me, that sounds like
this is an imapfilter bug, although not one I've seen 
reported before.

> I'm getting a little frusturated with this.  Is there 
> something I can do to keep this from happening?

Do you do all your training with "sb_imapfilter.py -t"?  Up 
until the
assertion error, does the training always successfully 
complete?  (i.e. it
doesn't crash halfway through?)

If you run db_expimp.py on your database to convert it 
to text
("db_expimp.py -e -d hammie.db -f hammie.txt" if it's a 
pickle) and open it
up, what are the ham and spam counts at the top?  (I 
suspect 0 for both).

=Tony Meyer
--------------------------------------------------
>From  jacob-spambayes-list at xxxxxxxxxxxxxxxxxx.xxx 
Subject  RE: [Spambayes] Serious Database Corruption 
Problems 
Date  Wed, October 15, 2003 10:56 pm 
To  spambayes at xxxxxx.xxx 

------------------------------------------------------
--------------------------


>> I'm getting a little frusturated with this.  Is there
>> something I can do to keep this from happening?
>
> Do you do all your training with "sb_imapfilter.py -t"?  
Up until the
> assertion error, does the training always successfully 
complete?  (i.e.
> it doesn't crash halfway through?)

Yes, I do all of my training that way.  The training 
always completes, and
then the program fails during classification.  I've 
included a typical
transcript below.  Something worth making note of: it 
seems like, many
times during training, it'll report that messages are 
trained when there
are no new messages in that particular folder.

>
> If you run db_expimp.py on your database to convert 
it to text
> ("db_expimp.py -e -d hammie.db -f hammie.txt" if it's 
a pickle) and open
> it
> up, what are the ham and spam counts at the top?  (I 
suspect 0 for both).

suslik% more hammie.txt
311,431,

I can send you the whole file if it'd be useful.

Thanks,
Jacob

----------------------------------------
A sample sb_imapfilter transcript:
----------------------------------------
Something worth noting about the following transcript: 
for the lines that look like these, the contents of those 
two folders never changed, so I don't understand why 
they indicated messages were trained.  It doesn't do 
this with the Inbox.

   Training ham folder INBOX
.*............       1 trained.
   Training spam folder INBOX.-Spam
*..........................................................................
............................................................................
............................................................................
............................................................................
............................................................................
....................................................
      1 trained.
----------------------------------------

suslik% ./sb_imapfilter.py -l 5 -c -t -v -d hammie.db
SpamBayes IMAP Filter Beta1, version 0.1 (September 
2003),
using SpamBayes IMAP Filter Web Interface Alpha2, 
version 0.02
and engine SpamBayes Beta2, version 0.2 (July 2003).

Loading state from hammie.db pickle
hammie.db is an existing pickle, with 310 ham and 417 
spam
Loading database hammie.db... Done.
Training
   Training ham folder INBOX.-Wanted
............................................................................
............................................................................
............................................................................
.....................................................................
      0 trained.
   Training ham folder INBOX
.*............       1 trained.
   Training spam folder INBOX.-Spam
*..........................................................................
............................................................................
............................................................................
............................................................................
............................................................................
......................................**************
      15 trained.
Persisting hammie.db as a pickle
Training took 35.0596210957 seconds, 16 messages 
were trained
Classifying
...................
Classified 0 ham, 0 spam, and 0 unsure.
Classifying took 0.656105995178 seconds.
Training
   Training ham folder INBOX.-Wanted
............................................................................
............................................................................
............................................................................
.....................................................................
      0 trained.
   Training ham folder INBOX
.*............       1 trained.
   Training spam folder INBOX.-Spam
*..........................................................................
............................................................................
............................................................................
............................................................................
............................................................................
....................................................
      1 trained.
Persisting hammie.db as a pickle
Training took 29.7854119539 seconds, 2 messages were 
trained
Classifying
..................*.Traceback (most recent call last):
  File "./sb_imapfilter.py", line 824, in ?
    run()
  File "./sb_imapfilter.py", line 814, in run
    imap_filter.Filter()
  File "./sb_imapfilter.py", line 675, in Filter
    self.unsure_folder)
  File "./sb_imapfilter.py", line 594, in Filter
    evidence=True)
  File "/u/jpfarmer/lib/python2.3/site-
packages/spambayes/classifier.py",
line 158, in chi2_spamprob
    clues = self._getclues(wordstream)
  File "/u/jpfarmer/lib/python2.3/site-
packages/spambayes/classifier.py",
line 395, in _getclues
    prob = self.probability(record)
  File "/u/jpfarmer/lib/python2.3/site-
packages/spambayes/classifier.py",
line 245, in probability
    assert spamcount <= nspam
AssertionError


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=826954&group_id=61702



More information about the Spambayes-bugs mailing list