[Spambayes] .db files

Fri Nov 14 02:18:27 EST 2003

[cweisbrod at cogeco.ca]
> I've recently started using SpamBayes and have had great success in
> finally avoiding unsolicited email. The only problem I have with
> SpamBayes is the significant overhead required to get the code
> running. Installing the Python binaries first and then having a
> process that consumes 10-15 Mb of runtime memory is less than optimal.

Are you running on a TRS-80 <wink>?

> I am a software engineer with 9+ years of experience and I fully
> understand the need for a project like this to be written in a
> platform-independant manner. However, I would really like to write a
> scaled-down version of this for the Windows platform. I'm quite
> certain that I can write a self-installing/uninstalling service that
> implements a POP/SMTP proxy, performs the statistical processing, and
> exposes user interfaces for training and configuring. I have a
> mathematics and physics degree, so the statistics stuff shouldn't be
> too difficult to implement. The entire implementation would be
> contained in a single executable file much less than 1 Mb in size (I
> don't use MFC).

Neither do we, at least not in this project.

> Of course this executable would have to create its own config files,
> spambayes.messageinfo.db, hammie.db, etc. That's where I need some
> information.

Others have addressed that, and it's complicated.  I'll warn you in advance
that the memory and processing power don't go into the statistics, though:
the memory mainly goes into caching a potentially huge number of token
statistics records in memory to make it bearably fast.  If you dig up each
token from an on-disk database each time you see it, it probably won't run
fast enough for high-volume users to tolerate.  You might overcome that by
writing your own database implementation, specialized to the particular data
formats this thing needs to access, one not burdened (as is the Sleepycat
DB) by needing to pay to be all things to all people.

You should also know that the actual scoring is a tiny part of the code, and
quite straightforward.  Parsing email, tokenization, dealing with a
gazillion incompatible email systems, and putting up a user-friendly UI are
each harder than the mathematical part, and the code for each dwarfs the
total code for all the computational parts.

BTW, despite the use of a memory cache on top of the Sleepycat DB, the
classifier runs several times faster still if I eliminate Sleepycat entirely
and run the entire database in memory (as one large Python dictionary).  For
a large (but not unreasonably large) token database, that can easily suck up
50MB of RAM just for the dict.

For perspective, I happened to be window-shopping for a new home computer
this week, and even entry-level systems come with 128MB of RAM these days;
I'll probably get at least 4x that much.  50MB is about 10% of that -- it
won't even be worth thinking about idly for me anymore.

Not that there's anything wrong with lean & mean.  It just went out of style
around the time you started your career <wink>.