[spambayes-dev] A new and altogether different bsddb breakage

Tim Peters tim.one at comcast.net
Sun Dec 14 00:05:02 EST 2003

[Richie Hindle]
> In response to Skip's question about hapax ratios, I ran his script
> and received an error.  I boiled the problem down to this:
> >>> print [db[k] for k in db]
> Traceback (most recent call last):
>   File "hapaxes.py", line 3, in ?
>     print [db[k] for k in db]
>   File "C:\Python23\lib\shelve.py", line 118, in __getitem__
>     f = StringIO(self.dict[key])
>   File "C:\Python23\lib\bsddb\__init__.py", line 86, in __getitem__
>     return self.db[key]
> KeyError: 'pics'
> Excuse me?  Er, so how many of these things are there?
> >>> len([k for k in db if db.get(k, None) is None]) 306

Ouch.  What do you get if you open the database directly, instead of
indirecting thru a shelf?  I'm just trying to make sure it's really the
database that's hosed.  For example, here's a complete program picking on my

PATH = "/WINDOWS/Application Data/SpamBayes/default_bayes_database.db"
import bsddb
d = bsddb.hashopen(PATH, 'r')
print len(d)
print len([k for k in d if d.get(k, None) is None])

That printed 40787, then 0, when I ran it just now.

> And what do they look like?

Doesn't matter -- it should never happen!

> >>> from pprint import pprint as p
> >>> p([k for i, k in enumerate(db) if db.get(k, None) is None and i
> % 50 == 0])
> ['magnetism',
>  'url:mlqnuvs',
>  'from:addr:wi872u',
>  'autograph.',
>  'url:ff-programs',
>  'motels,']
> So they have nothing obvious in common.  Looking through the full list
> it's obvious that they don't all come from one message.  Some are
> obviously ham clues and some are obviously spam.
> I'm probably winging my way towards a DBRunRecovery error, unless
> someone can explain what's going on?

I've fixed miserable *similar* bugs in ZODB's BTrees (enumerating finds keys
that direct lookup doesn't believe exist), so I'm not shocked if some other
database screws up in this way too.  Gotta say, I'm half ready to declare
that ZODB is the only database anyone should ever use (the bugs in that are
long fixed <wink>).

More information about the spambayes-dev mailing list