[Spambayes] RE: [spambayes-dev] A query on Frequently Asked Question 4.6

Tony Meyer tameyer at ihug.co.nz
Mon Nov 1 05:21:05 CET 2004


> The FAQ item 4.6 above addressed something that I've 
> wondered about since I got SB going around July this year,
> that is, "Where does all the 'saved' spam go?" <smile>. 

The messages that end up in your mailer stay there, unless you tell it to do
something with it (with the Outlook plug-in, this means they stay in the
designated spam folder).  The copies of the messages that sb_server makes
are moved to a folder where they are stored for a set amount of time (by
default 7 days) and then are permanently deleted (like if you choose
"discard" from the review page).

The information that SpamBayes collects from messages that it's trained on
is stored in a database (called hammie.db or default_bayes_database.db by
default).  This is just a list of all the tokens ('words') that have been in
all of those messages and a count of how many good & how many bad messages
each token has been seen.  It's not possible (with an unpatched SpamBayes)
to go backwards from the database to a particular message - to 'untrain',
you have to have a copy of the original message around.

> I do have a lot of spam now, 
> 16,054 spams and 1,725 hams. 
> The thing is where do I go to cut down on the amount of spam? 
> I see the comment, "Warning: you have much more spam 
> than ham - SpamBayes 
> works best with approximately even numbers of ham and spam." 
> in the 'Status and 
> Configuration' section.

Firstly, the golden rule is "if it's working, don't change anything" (where
working would be defined as classifying well enough that you are happy with
the results).  All sets of mail are different, but it does appear that
results are better when the balance is reasonably even - and certainly that
large imbalances (eg 50::1) can cause weird results.  You're at 9.3::1,
which isn't extremely bad, although it wouldn't be surprising if results
weren't optimum.

That's quite a large database, too, really.  Many of the developers have
databases with less than a thousand messages in total, with good results.
However, at the moment, it's only possible to add to the db, not remove from
it.  The general practice is to retrain from scratch when that's desired -
since SpamBayes learns quickly, that's not usually a problem.

I'd suggest reading (if you haven't already) the information about training
on the wiki: <http://entrian.com/sbwiki>.  In particular, doing 'mistake
based training' or 'nonedge training' may help keep the db small, *may* help
the imbalance, and may help the results.

We're trying to figure out ways to help with the imbalance issue for future
releases, although there's no clear solution as yet.

(You could put aside the current databases, try out, with fresh ones, a new
training regime and see if that works.  If it doesn't, you can just put the
old database back!).

> 4.3   How do I train SpamBayes (forward/bounce method)?
[...]
> ...I saw from that item 4.3 that there are locations at...
> spambayes_spam at localhost
> ...and at...
> spambayes_ham at localhost
> ...are these locations on this my computer or are they 
> located at my ISP's server? 

These aren't locations, they are special email addresses.  The idea is: 

  All your outgoing mail goes via a SMTP server (at your ISP, probably).
  Rather than have to use the review page, you can bounce/forward messages
  to SpamBayes to indicate that they need to be trained.  The way this
  works is to have all outgoing mail go through a proxy on its way to the
  SMTP server (like incoming mail goes through a proxy on its way from the
  POP server).  This SpamBayes SMTP proxy examines the recipient list of
  each outgoing message and if it is going to one of those two special
  addresses (which don't exist), then they are intercepted and the messages
  are used for training instead.  Some people use this method for training,
  but it's not particularly popular, and does have some flaws.  It's better
  to use the review page in most cases.  (There's another, unfinished,
script
  that offers all-in-the-mail-client training, which I have more hope
  for).

=Tony.Meyer

-- 
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.



More information about the Spambayes mailing list