[Spambayes] Dumping the db

Fri Feb 14 10:06:48 EST 2003

> Is there a convenient way to dump the db used by spammie?

Nope, but Outlook2000/tester.py effectively does this so it can pull out the
most and least spammy words, and create a mail with nothing but them
(couldn't think of a better way to auto-generate a certain spam guaranteed
to be spam regardless of your training data :)

It is often more interesting to show the "spam clues" - this is a dump of
the database, but only of the words that contributed to the score for the
particular mail.

You may be dissapointed at what a database dump looks like - a list of words
with 2 counts.

> I'm also curious as to the mechanics of "recover from spam" -
> how effective
> it it at unlearning?

It should be "perfect" - however, I expect "perfect" to the engine is
different to what it means to you.

After doing a "recover from spam", the database should be left in the same
state as if that particular message had originally been trained as "ham".
Specifically, all that code does is to train on that single message as ham,
then move the message back to where we first got it from.  (There is some
extra logic in place that ensures that if the message had previously been
trained as spam, that will be undone - but this is a rare case.  In most
cases, a message has just been *filtered* when you recover it, not actually
*trained*.)

Note that all this does is update the training data.  You may have noticed
that some messages in your "Inbox" still have a high(ish) spam rating, even
though they have been trained as good.  It can be disconcerting to train the
system that a mail is good, just to see that the system still treats the
message as suspect.  As you get more similar messages and continue to train
(by recovering), it gets better.

I have found it useful to create a folder of "Good, spam-looking mail".
This is mail that I have seen spambayes occasionally mis-classify, and that
I would normally have just deleted.  I keep them around so that when I need
to start from scratch, I have some examples of good stuff to help it out,
and avoid that initial stage.  I've actually been thinking that it may be
useful for spambayes to automatically keep copies of such messages, should a
full retrain ever be necessary "in the wild".

> Does the unlearning make a difference based on whether it's
> in "spam" or
> "spam maybe"?

Nope.  As I mentioned above, in the vast majority of cases, no "unlearning"
is done - only learning that it is good (or bad, depending on the case).
Unlearning is only neccessary if it has previously been trained incorrectly,
and this is rare.

Mark.