[Spambayes] Shelve database corruption?

Lars Marius Garshol larsga at ontopia.net
Wed May 19 10:47:32 EDT 2004


* Lars Marius Garshol
|
|     f = StringIO(self.dict[key])
| error: (-30981, 'Unknown error 4294936315')
| 
| What I'm looking for is some idea of what's wrong with the shelve
| database so that I can fix the corruption. 

* Skip Montanaro
| 
| I've no idea.  If the underlying file is a Berkeley DB thing of some
| sort, 

I've just installed and used the POP3 proxy on Linux without doing
anything special at all. This is what I've been able to find out:

[root at pavarotti spambayes-1.0rc1]# file hammie.db
hammie.db: Berkeley DB (Hash, version 7, native byte-order)
[root at pavarotti spambayes-1.0rc1]# python utilities/which_database.py
Pickle is available.
Dumbdbm is available.
Dbhash is available.
Bsddb[3] is available.
 
Your storage /usr/home/larsga/tmp/spambayes-1.0rc1/hammie.db is a: dbhash

| you might be able to restore it to good health using the db_recover
| command (which you may or may not have on-hand).

I've got db_recover, and ran it to find out what it does, but it
doesn't appear to do anything. I found the documentation for it, which
talks about log files, but I don't appear to have any. Not sure
whether this means I don't have the right sort of Berkeley DB or
whether it means that the log files are somewhere else.

| Whether or not that succeeds you might try running the
| sb_dbexpimp.py command to dump the database and reload it into a new
| file.

That seemed to work. I managed to export hammie.db to a CSV file, but
for whatever reason I can't import it to a new hammie.db:

[root at pavarotti spambayes-1.0rc1]# sb_dbexpimp.py -i -d hammie.db -f hammie.db.export
Traceback (most recent call last):
  File "/usr/bin/sb_dbexpimp.py", line 266, in ?
    runImport(dbFN, useDBM, newDBM, flatFN)
  File "/usr/bin/sb_dbexpimp.py", line 183, in runImport
    (nham, nspam) = rdr.next()
  File "/usr/lib/python2.2/site-packages/spambayes/compatcsv.py", line 23, in next
    return self.parse_line(self.fp.next())
AttributeError: 'file' object has no attribute 'next'

>From the traceback this seems like it can't read the CSV file (which
looks fine in "less"), but that seems really bizarre. Any ideas?

I'm using Python 2.2.2; could this be a compatibility thing?
I see that the code uses compatcsv.py, creates a "reader" object,
passing it a file handle (as returned by "open"), but the reader then
calls "next" on it, which it obviously does not support.

As far as I can tell, in Python 2.2.2 file objects don't have a next()
method. It would appear, though, that readline() would do the same
thing. Changing that made it run a bit further, but I still got
problems:

[root at pavarotti spambayes-1.0rc1]# sb_dbexpimp.py -i -d hammie.db -f hammie.db.export
parse error: 2
Traceback (most recent call last):
  File "/usr/bin/sb_dbexpimp.py", line 266, in ?
    runImport(dbFN, useDBM, newDBM, flatFN)
  File "/usr/bin/sb_dbexpimp.py", line 183, in runImport
    (nham, nspam) = rdr.next()
  File "/usr/lib/python2.2/site-packages/spambayes/compatcsv.py", line 23, in next
    return self.parse_line(self.fp.readline())
  File "/usr/bin/sb_dbexpimp.py", line 170, in runImport
    os.unlink(dbFN+".dir")
OSError: [Errno 2] No such file or directory: 'hammie.db.dir'

I've no idea why it would want to delete this file. As far as I know
it's never existed. What's even more interesting is that there's a
try/except block catching OSError here, so we shouldn't be seeing this
traceback at all. 

Removing OSError so it catches all exceptions has no effect. Removing
the whole try/except plus the unlink call makes the block above
deleting the .dat file throw an OSError!?! Removing all three unlink
calls gives an ImportError on the csv module, so something appears to
be rethrowing the last exception, and probably that's always the last
one that's been caught, and not really the true cause of the problem
at all. Seems like it might be line 59 in compatcsv.py that does this.

Apparently it's unhappy with hammie.db.export, so apparently
sb_dbexpimp.py produces CSV files that it can't read back in
again. I discovered that line 66 of compatcsv.py has a bug that means
it's never worked:

                line = line[len(field)+len(match.group(2))]

should be:

                line = line[len(field)+len(match.group(2)):]

Then I got into trouble with compatcsv.py assuming the file was UTF-8,
and I haven't been able to fiddle more with it. It does look like this
code doesn't run on Python 2.2 at all. I'll have to consider
installing 2.3 or spending more time on fixing it.

* Lars Marius Garshol
|
| I've trained SpamBayes on 37,000 emails, so the idea of starting
| again from scratch is not appealing...

* Skip Montanaro
|
| Which seems like way too much to me, but maybe that's just me.  If
| you're having trouble properly classifying messages with that
| database (ignoring the corruption issue) it's likely there are some
| mistakes or at least questionable classifications in there which are
| contributing to confusion on SpamBayes part.

Classification was working perfectly before the corruption, actually.
I get several hundred spam every day, but nothing comes through,
except maybe 10-15 unsure emails every week. As far as I can tell
there are no false positives, and I've checked pretty carefully.

So if only I can solve this corruption problem I'll be a very happy
user again.

Anyway, thanks a lot for the help so far. I feel I'm at least one step
closer now.

-- 
Lars Marius Garshol, Ontopian         <URL: http://www.ontopia.net >
GSM: +47 98 21 55 50                  <URL: http://www.garshol.priv.no >




More information about the Spambayes mailing list