[Spambayes] .db files

Fri Nov 14 12:09:42 EST 2003

[cweisbrod at cogeco.ca]
> Thanks very much for the input and the warnings. I realize I've been
> somewhat naive in my contemplation over the implementation of
> SpamBayes. Particularly didn't realize there is so much
> incompatibility with email systems. Can you give me an example of
> some incompatibilities that you have encountered?

Well, I'm not clear on what you really want to accomplish.  You said

   I would really like to write a scaled-down version of this for the
   Windows platform

The spambayes Outlook addin is overwhelmingly the most popular way to use
spambayes on Windows, so if you want to target that then you have to
reimplement all the code we use to interface with Outlook directly (which is
a much larger pile of code than the classifier proper -- and much harder to
get right (the MS APIs you need to work with are massive and delicate)).

OTOH, if you don't want to replace the Outlook addin, few current
spambayes-on-Windows users will care (the Outlook addin is much easier to
live with than any web-based interface, because the former integrates
directly with normal user Outlook workflow -- it's just another set of
buttons on the Outlook toolbar, and, for the most part, works "all by magic"
after initial setup and training).

If you want to integrate with non-Outlook clients on Windows, they each have
their own ways of dealing with email, and there are several different
formats (some open, some proprietary) in use for storing email.

If you just want to deal with proxies, then you've got POP3 and a large
variety of partially-broken IMAP servers to wrestle with.  If you want the
same functionality in your classifier, you're going to have to write code to
decode email in its full current generality, including character set
conversions, recursive MIME structure, and translating assorted encodings
(spambayes routlnely decodes quoted-printable and base64, but the latter
only when the MIME type of a section is text/*; we also decode numeric
character entities in HTML).  Python comes with libraries for doing all of
that stuff, written by experts in the various fields, and debugged by
thousands of users in real life over a span of years.  This is harder to do
than you'd first believe, because while standards exist covering these
areas, writing software that adheres to those standards is useless in
practice:  too many programs that generate and send email violate too many
of the specified rules, so software trying to make sense of email has to be
extremely forgiving.  You can't know how in advance, though -- you find out
what you need to forgive by seeing real-life email break your code,
iterating until new email stops breaking it.  That's been going on for years
in the Python libraries, and is still in progress (it's a kind of thing that
can never end, as new bugs in email producers and servers keep appearing,
and some of those become popular for non-technical reasons).

It's possible to write a different kind of tokenizer, one that takes the
incoming email bytes at face value, and doesn't try to do any semantic
analysis.  That kind of tokenizer won't break no matter how wildly a piece
of email may violate "the rules" (because you're ignoring all the rules then
too).  Implementation shortcuts like that could save you tens of thousands
of lines of code and months of effort -- but then it gets increasingly
distant from what the current spambayes code does.