[Spambayes] Client/server model

Guido van Rossum guido@python.org
Thu Oct 17 20:05:25 2002

> > What would make more sense from the POV of minimizing traffic and
> > minimizing work done in the server:
> > 
> >   cli parses the message
> >   cli sends the list of tokens to svr
> I'd want the server to do tokenization for consistency reasons.
> Particularly if you are also spam filtering news articles and not
> just e-mail messages.

I don't understand this.

> Also, the server can have all that mail parsing code (discarding
> attachments, decoding BASE64 etc), making the client simpler.

But discarding attachments in the client would reduce the traffic to
the server tremendously!  Maybe your server has more available CPU
power than your client though?

> >   svr scores the list of tokens
> >   svr returns the text to be inserted in the X-Hammie-Disposition header
> I'm returning the spam ratio in my server (using BeOS inter-program
> communication, though I suppose I could use the package which extends
> the BMessage system to the Internet, but the spam database is really
> a per-user thing so that isn't useful).  I let the client decide if
> it's over their own threshold limit or not (ok, that may be a bad design
> choice).  I'm also returning the list of words and their individual
> scores, but that's mostly for debugging (and wastes a lot of space -
> 150 words at a time!).  The client (a plug-in filter for the BeMail
> package) also does the sound effects (saying "Spam" or "Genuine" as
> each message comes in).

Cool. :-)

> >   cli inserts the X-Hammie-Disposition in the message
> >   cli prints the message to stdout
> > 
> > (I like to minimize traffic as well as the work done by the server;
> > minimizing traffic is always a good idea, while minimizing server work
> > means less load on a shared server -- if the clients run on separate
> > machines, the combined CPU power of the clients is much more than that
> > of the server.)
> Actually, it turns out that my server approach really isn't needed
> for speed reasons.  It just takes a fraction of a second to load and
> parse the spam database (a 0.5MB (stripped of unique strings after
> initial training on 1500 messages / 21000 words) text file with
> words and numbers).  But still it's nice to have it separate from
> other programs so that it is more modular.

Fractions of seconds add up. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)