[Spambayes] .db files

Kenny Pitt kennypitt at hotmail.com
Thu Nov 13 16:27:19 EST 2003


cweisbrod at cogeco.ca wrote:
> Hi there,
> 
> I've recently started using SpamBayes and have had great success in
> finally avoiding unsolicited email. The only problem I have with
> SpamBayes is the significant overhead required to get the code
> running. Installing the Python binaries first and then having a
> process that consumes 10-15 Mb of runtime memory is less than optimal.
> 
> I am a software engineer with 9+ years of experience and I fully
> understand the need for a project like this to be written in a
> platform-independant manner. However, I would really like to write a
> scaled-down version of this for the Windows platform. I'm quite
> certain that I can write a self-installing/uninstalling service that
> implements a POP/SMTP proxy, performs the statistical processing, and
> exposes user interfaces for training and configuring. I have a
> mathematics and physics degree, so the statistics stuff shouldn't be
> too difficult to implement. The entire implementation would be
> contained in a single executable file much less than 1 Mb in size (I
> don't use MFC). Of course this executable would have to create its'
> own config files, spambayes.messageinfo.db, hammie.db, etc. That's
> where I need some information. 

If this is the type of implementation you're looking for, you might be
better served by the K9 filter here: http://keir.net/k9.html.  It is not
open source but it is freeware, and the executable is only about 70k and
uses very little memory.

> The best implementation would make use of existing
> spambayes.messageinfo.db and hammie.db files so that any existing
> training is not lost. However, I need to know how these files are
> organized. Short of studying the Python source for answers, is there
> any specific documentation on the organization of these files? If
> not, would somebody involved with this project be willing to provide
> me with this information? 

The .db files use the Berkeley DB engine from Sleepy Cat, and IIRC the
actual data records are stored in a python-specific format.  Your best
bet would be to use the sb_dbexpimp script to export your training data
into a flat file, from which it would be easy to parse and convert to
any format you want.

Be careful if you try to convert the data for a different Bayesian
engine because SpamBayes adds special tokens based on the structure of
the message that probably wouldn't be understood by a different engine.
K9, for example, also adds special tokens to the training data but in a
different format and with different meanings.

-- 
Kenny Pitt




More information about the Spambayes mailing list