[Spambayes] Documentation...

Wed Nov 20 21:30:38 2002

This may be premature, but as part of helping John Draper set up the
spambayes software I've made a start on some user documentation.  It could
go on the website, or maybe in with the source code - I'm not sure we're
ready to give the impression that this stuff is ready for "normal people"
to use yet.

This stuff refers to the current, unpackaged sources - if we ever package
it up, the documentation will be very different.  But I'm guessing that's a
long way off, and in the meantime we'll all be asked by friends and project
newcomers to explain how it all fits together and how to get it up and
running - this is an attempt to let us say "Here, read this!" when that
happens.

It tries to target both technical and non-technical users (though for some
fairly high values of "non-technical") and may well fall between two stools
as a result.  I'll check it in either with the sources or the website
depending on feedback.  If anyone spots glaring omissions, factual
inaccuracies or downright rudeness, either let me know or edit it after I
check it in - I'm not claiming any editorial rights!

It's also somewhat biased towards the POP3 proxy and the web interface (for
obvious reasons 8-) and lacks any detail on the Outlook plugin because I'm
not one of those lucky Outlook users... this is a not-at-all-veiled plea
for contributors or users who know about the lacking areas to step forward
and write some words!

--------------------------------------------------------------------------

> First some concepts:

 o 'Ham' is the opposite of 'Spam'. 8-)

 o At no point does any part of Spambayes delete emails.  All it does is
   classify them by adding a header that tells you whether they look like
   spam or not.  It's then up to you to use your email software to do
   something in response to that header (the Outlook plug-in does some of
   the work for you).

 o The header that the software adds is called X-Hammie-Disposition (mostly
   for historical reasons, and you can customise it) and has a value of
   Yes, No or Unsure.

> There are six main components to the Spambayes system:

 o A database.  Loosely speaking, this is a collection of words and
   associated spam and ham probabilities.  The database says "If a message
   contains the word 'Viagra' then there's a 98% chance that it's spam, and
   a 2% chance that it's ham."  This database is created by training - you
   give it messages, tell it whether those messages are ham or spam, and it
   adjusts its probabilities accordingly.  How to train it is covered
   below.  By default it lives in a file called "hammie.db".

 o The tokeniser/classifier.  This is the core engine of the system.  The
   tokenizer splits emails into tokens (words, roughly speaking), and the
   classifier looks at those tokens to determine whether the message looks
   like spam or not.  You don't use the tokeniser/classifier directly -
   it powers the other parts of the system.

 o The POP3 proxy.  This sits between your email client (Eudora, Outlook
   Express, etc) and your email server, and adds the classification header
   to emails as you download them.  A typical user's email setup looks
   like this:

       +-----------------+                              +-------------+
       | Outlook Express |      Internet or intranet    |             |
       |  (or similar)   | <--------------------------> | POP3 server |
       |                 |                              |             |
       +-----------------+                              +-------------+

   The POP3 server runs either at your ISP for internet mail, or somewhere
   on your internal network for corporate mail.  The POP3 proxy sits in the
   middle and adds the classification header as you retrieve your email:

       +-----------------+        +------------+        +-------------+
       | Outlook Express |        | Spambayes  |        |             |
       |  (or similar)   | <----> | POP3 proxy | <----> | POP3 server |
       |                 |        |            |        |             |
       +-----------------+        +------------+        +-------------+

   So where you currently have your email client configured to talk to
   say, "pop3.my-isp.com", you instead configure the *proxy* to talk to
   "pop3.my-isp.com" and configure your email client to talk to the proxy.
   The POP3 proxy can live on your PC, or on the same machine as the POP3
   server, or on a different machine entirely, it really doesn't matter.
   Say it's living on your PC, you'd configure your email client to talk
   to "localhost".

 o The web interface.  This is a server that runs alongside the POP3 proxy
   and lets you control it through the web.  You can upload emails to it
   for training or classification, query the probabilities database ("How
   many of my emails really *do* contain the word Viagra"?) and most
   importantly, train it on the emails you've received.  When you start
   using the system, unless you train it using the Hammie script it will
   classify most things as Unsure, and often make mistakes.  But it keeps
   copies of all the email's its seen, and through the web interface you
   can train it by going through a list of all the emails you've received
   and checking a Ham/Spam box next to each one.  After training on a few
   messages (say 20 spams and 20 hams), you'll find that it's getting it
   right most of the time.   The web training interface automatically
   checks the Ham/Spam boxes according to what it thinks, so all you need
   to do it correct the odd mistake - it's very quick and easy.

 o The Outlook plug-in.  For Outlook 2000 users (not Outlook Express) this
   lets you manage the whole thing from within Outlook.  You set up a Ham
   folder and a Spam folder, and train it simply by dragging messages into
   those folders.  Alternatively there are buttons to do the same thing. 
   And it integrates into Outlook's filtering system to make it easy to
   file all the suspected spam into its own folder, for instance.

 o The Hammie script.  This does three jobs: command-line training,
   procmail filtering, and XML-RPC.  To train on a whole collection of
   messages, stored either as mbox files or as collections of message files
   in a directory, you run "hammie.py -g ham -s spam", where 'ham' is the
   mbox file or directory containing ham, and 'spam' is the mbox file or
   directory containing spam.  Procmail filtering is a unix-based email
   filtering system - to use Hammie as a procmail filter, run it as
   "hammie.py -f" from a procmail rule.  It will read a message from its
   input, add the header, and write it to its output.  Hammie can also
   run as an XML-RPC server, so that a programmer can write code that uses
   a remote server to classify emails programmatically - see hammiesrv.py.

> Where things live:

The Hammie script is called hammie.py.  The POP3 proxy and the web
interface live in pop3proxy.py.  The Outlook plug-in lives in the
Outlook2000 subdirectory - see the README.txt in that directory for more
information on that.

As well as these components, there's also a whole pile of utility scripts,
test harnesses and so on - see README.txt and TESTING.txt in the spambayes
distribution for more information.

> Configuration:

The system is configured through a file called "bayescustomize.ini".  In
here you can configure the name and type of your database, the POP3
server(s) you want to proxy to, the ports you want the proxy and the web
interface to run on, and so on.  You can also control details like how sure
you want the system to be that message really is spam before it marks it as
such.  The default values for all the options, and the documentation for
them, all lives in Options.py.  To change an option, create a
bayescustomize.ini and add the option to that - don't edit Options.py.

> Requirements:

To run the software, you need Python 2.2 or above.  You also need version
2.4.3 or above of the Python "email" package.  If you're running the CVS
version of Python (known as 2.3a0) then you already have this.  If not, you
can download it from http://mimelib.sf.net and install it - unpack the
archive, cd to the email-2.4.3 directory and type "python setup.py
install".  This will install it into your Python site-packages directory.
You'll also need to move aside the standard "email" library - go to your
Python "Lib" directory and rename "email" to "email_old".

> Setup on unix (Windows/Mac users can ignore this bit):

On a unix machine, unless you're running as root (which we strongly advise
you don't!) you can't run the proxy on port 110.  Besides, you quite
possibly already have a POP3 server running on that port.  You need to run
it on an unprivileged port, say 1110.  You do this by adding the line

pop3proxy_ports: 1110

to bayescustomize.ini - all will become clear in the next section.  Where
we talk about port 110, you use port 1110.

> Minimal setup for using the POP3 proxy and web interface:

The minimum you need too do to get started is create a bayescustomize.ini
containing the following:

[pop3proxy]
pop3proxy_servers: pop3.my-isp.com

where "pop3.my-isp.com" is wherever you currently have your email client
configured to collect mail from.

You can now run the proxy by running "python pop3proxy.py".  This will
print some status messages, which should include:

BayesProxyListener listening on port 110.
UserInterfaceListener listening on port 8880.

What that means is that the POP3 proxy is ready for your email client to
connect to it (110 is the standard port number for POP3 - you can use a
different one by adding a line to bayescustomize.ini - see Options.py) and
that the web interface is ready for your browser to connect to it.  The
address of the web interface is http://localhost:8880/ (or if you're
running it on a different machine, replace 'localhost' with the name of the
machine).  You can have a look at the web interface now, but it won't be
very interesting because the system is untrained and has seen no messages
yet.

> Reading emails and training the classifier:

You now need to configure your email client to talk to the proxy instead of
the real email server.  Change your equivalent of "pop3.my-isp.com" to
"localhost" (or to the name of the machine you're running the proxy on) in
your email client's setup.  Hit "Get new email" and look at the headers of
the emails (send yourself an email if you don't have any!) - there should
be an X-Hammie-Disposition header there.  It probably says "Unsure",
because you haven't done any training yet.  You should be able to create a
mail folder called "Suspected spam" and set up a filtering rule that puts
emails with an "X-Hammie-Disposition: Yes" heading into that folder.
(Eventually we should publish instructions on how to do this in all the
popular email clients).

You can now train the system through the web interface - follow the "Review
messages" link and you'll see a list of the emails that the system has seen
so far.  Check the appropriate boxes and hit Train.  The messages disappear
(eventually you'll be able to get back to them, for instance to correct any
training mistakes) and if you go back to the home page you'll see that the
"Total emails trained" has increased.

Once you've done this on a few spams and a few hams, you'll find that the
X-Hammie-Disposition header is getting it right most of the time.  The more
you train it the more accurate it gets.  There's no need to train it on
every message you receive, but you should train on a few spams and a few
hams on a regular basis.  You should also try to train it on about the same
number of spams as hams.

You can train it on lots of message in one go using the Hammie script, as
explained above.

--------------------------------------------------------------------------

-- 
Richie Hindle
richie@entrian.com