[spambayes-dev] Re: Pickle vs DB inconsistencies

Mon Jul 14 01:55:06 EDT 2003

[Greg Ward]
> ...
> My recent experiences with bugs in the DB[M] storage -- still the
> ostensible subject of this thread! -- are a good argument against
> this. Tim P's words were along the lines of, "The pickle
> implementation is so simple that it's obviously correct",

The original dict implementation was so simple it was obviously correct, but
"obviously" got lost to the layers of indirection added to support database
backends.  Still, classifier.py's default implementations of the wordinfo
mutators remain obviously correct:

    def _wordinfoget(self, word):
        return self.wordinfo.get(word)

    def _wordinfoset(self, word, record):
        self.wordinfo[word] = record

    def _wordinfodel(self, word):
        del self.wordinfo[word]

(all of those are used when a dict is used for self.wordinfo, although they
slow it down a lot -- each dict operation turns into a Python function
call).

> and it's good to have a gold standard, even if it is a memory hog and
> slow to startup.

Using a dumbdbm backend is more of a memory hog than using a dict.  dumddbm
internally constructs a dict mapping every key in the database to (offset,
size) pairs of integers as soon as it's opened, where the pairs refer to
positions in the .dat file.  Then more dicts constructed by DBDictClassifier
are on top of that.  Using a plain dict instead directly maps every key in
the database to (nspam, nham) pairs of integers as soon as it's opened, but
without additional layers of dicts.  So dumbdbm loses to a plain dict on all
counts (speed, memory usage, disk-file size, robustness, clarity).

I still use a plain dict for my own classifiers, and am very happy with it.
The number of database bugs I've fixed is infinitely larger than the number
I've experienced <0.9 wink>.

BTW, spambayes would be happiest using a ZODB OOBTree for the classifier's
wordinfo structure, and the classifier was designed with that in mind.
Jeremy hooked that up once, using ZEO to share a live database across
multiple remote connections.  The code is in the pspam subdirectory,
although it's probably suffering from bitrot by now.

> ...
> (I'm also not a fan of the .py extension on scripts, for reasons that
> I really can't explain.  I think it's because it reveals an irrelevant
> implementation detail -- the programming language used -- to users of
> the script.  Damn, I guess I should have added an install-time
> distutils option to strip .py from script names.  Unfortunately, my
> time machine isn't as good as Guido's.)

You should leave the .py extension on almost all scripts, just as the Python
distribution does for almost all scripts.  #! and binfmt gimmicks are
OS-specific -- the .py extension is necessary for some OSes to know which
program to use to run a script.  Of course you can leave .py off of scripts
that have no hope of running under anything but Linux.

>> (BTW, why can I import optparse in 2.2.3 when the doc says it only
>> arrives in 2.3?)

> Good question -- I guess someone backported it when I wasn't looking!

No, someone is confused there -- optparse doesn't exist in 2.2.3.  I suspect
they have an overly generous PYTHONPATH setting.