[Fwd: [Spambayes] [spambayes-dev] A query on Frequently Asked Question 4.6]

Remi Ricard papaDoc at videotron.ca
Tue Nov 2 14:34:16 CET 2004

Hi Harry,

This a cut and paste of the message from Tony so you can delete the one 
with the attached email

The FAQ item 4.6 above addressed something that I've 
> wondered about since I got SB going around July this year,
> that is, "Where does all the 'saved' spam go?" <smile>. 

The messages that end up in your mailer stay there, unless you tell it to do
something with it (with the Outlook plug-in, this means they stay in the
designated spam folder).  The copies of the messages that sb_server makes
are moved to a folder where they are stored for a set amount of time (by
default 7 days) and then are permanently deleted (like if you choose
"discard" from the review page).

The information that SpamBayes collects from messages that it's trained on
is stored in a database (called hammie.db or default_bayes_database.db by
default).  This is just a list of all the tokens ('words') that have been in
all of those messages and a count of how many good & how many bad messages
each token has been seen.  It's not possible (with an unpatched SpamBayes)
to go backwards from the database to a particular message - to 'untrain',
you have to have a copy of the original message around.

>> I do have a lot of spam now, 
>> 16,054 spams and 1,725 hams. 
>> The thing is where do I go to cut down on the amount of spam? 
>> I see the comment, "Warning: you have much more spam 
>> than ham - SpamBayes 
>> works best with approximately even numbers of ham and spam." 
>> in the 'Status and 
>> Configuration' section.

Firstly, the golden rule is "if it's working, don't change anything" (where
working would be defined as classifying well enough that you are happy with
the results).  All sets of mail are different, but it does appear that
results are better when the balance is reasonably even - and certainly that
large imbalances (eg 50::1) can cause weird results.  You're at 9.3::1,
which isn't extremely bad, although it wouldn't be surprising if results
weren't optimum.

That's quite a large database, too, really.  Many of the developers have
databases with less than a thousand messages in total, with good results.
However, at the moment, it's only possible to add to the db, not remove from
it.  The general practice is to retrain from scratch when that's desired -
since SpamBayes learns quickly, that's not usually a problem.

I'd suggest reading (if you haven't already) the information about training
on the wiki: <http://entrian.com/sbwiki>.  In particular, doing 'mistake
based training' or 'nonedge training' may help keep the db small, *may* help
the imbalance, and may help the results.

We're trying to figure out ways to help with the imbalance issue for future
releases, although there's no clear solution as yet.

(You could put aside the current databases, try out, with fresh ones, a new
training regime and see if that works.  If it doesn't, you can just put the
old database back!).

>> 4.3   How do I train SpamBayes (forward/bounce method)?

>> ...I saw from that item 4.3 that there are locations at...
>> spambayes_spam at localhost
>> ...and at...
>> spambayes_ham at localhost
>> ...are these locations on this my computer or are they 
>> located at my ISP's server? 

These aren't locations, they are special email addresses.  The idea is: 

  All your outgoing mail goes via a SMTP server (at your ISP, probably).
  Rather than have to use the review page, you can bounce/forward messages
  to SpamBayes to indicate that they need to be trained.  The way this
  works is to have all outgoing mail go through a proxy on its way to the
  SMTP server (like incoming mail goes through a proxy on its way from the
  POP server).  This SpamBayes SMTP proxy examines the recipient list of
  each outgoing message and if it is going to one of those two special
  addresses (which don't exist), then they are intercepted and the messages
  are used for training instead.  Some people use this method for training,
  but it's not particularly popular, and does have some flaws.  It's better
  to use the review page in most cases.  (There's another, unfinished,
  that offers all-in-the-mail-client training, which I have more hope

>In article <4186E714.6010504 at videotron.ca>, Remi Ricard wrote:
>>[Attachment decoded to FILE://\PROGRA~1\VA\DOWNLOAD\_Spambayes_ RE_ _spambayes-dev_ A query on Frequently Asked Question 4.6]
>>[Unable to display message/rfc822 in \PROGRA~1\VA\DOWNLOAD\msw00811.tmp]
>     I got this from you and it landed, as it should, in Virtual Access' (VA's) DOWNLOAD folder (curiously it landed there and 
>decoded itself in the process - I have VA set that any attachment that comes into DOWNLOAD, comes in as an 'Extract' and needs 
>me to okay things before it decodes - stranger and stranger). 
>     In any case it would not open up. I tried to open it up using Word but it told me that it was '...not a valid archive'. 
>     Perhaps there is another way of opening it? 
You can try with notepad since an email is simply a text file.
What are you using to read your mail ?

>     ++++++++++
>     Whoa! Now I find that every time my cursor crosses that entry in the VA  DOWNLOAD folder I immediately get an XP  'Save 
>As' box and no amount of cancelling will get rid of it. 
>     I shall wait till I hear from you before trying to delete that attachment. 
>     I'll send this off just now I have another email on hold just now about the SpamBayes saga <s>. I hope you have the 
>patience to follow this. 
I'm using Win2K with Thunderbird so I can't hep you on this one sorry .


More information about the Spambayes mailing list