[Spambayes] Header annotations included in classification?

Tony Meyer tameyer at ihug.co.nz
Tue Nov 25 17:40:10 EST 2003


[Remi]
> If you retrain (an email was not classified correctly, or 
> retrain from scratch with old mail ) this can be a problem
> but there is options to not look at headers  but I think
> the subject is always classified.

[Dave Brueck]
> Aha! Thanks for your help - I feel a bit dumb but now 
> understand my problem better. :) On the Home -> Review web 
> page I was looking through the list of emails for which it 
> was unsure and marking them as Ham or Spam and then clicking 
> the Train button. But as Spambayes began to classify emails 
> on its own they began to also show up in their own section 
> ("Messages classified as Spam") and by default they have the 
> "Spam" radio button selected so by clicking Train I was 
> running them through as training material as well.

A little late, but some clarification:

When SpamBayes trains via (the review page of) the web interface, it's not
accessing the mail from your mail client (whatever it is) - it's using a
copy of the message that it stored in it's cache when the message arrived.
This copy does not have any SpamBayes information in it - it's a raw copy of
the incoming message.  So there's no way for training via (the review page
of) the web interface to include SpamBayes data, unless it was in the
original message.

The review page could strip out the [Spam] bit from the headers, if it is
adding them; I note that there's a feature request open about this, so I'll
(or someone else'll) get to this at some point.

If you *retrain* (correct previous training, for example) on mail via the
web interface (using the "find message" query), then this also works with
the cached messages, so will also work correctly.  If, however, you train
from scratch by providing mbox/dbx files via the front page of the web
interface, and the mail in those files has SpamBayes data, it *will* be
used.  For the most part, this won't matter - all the SpamBayes headers are
ignored by default, so they won't have any effect.  The modified subject/to
header, *will* be tokenized, though.  (This isn't something I had considered
previously).

I think that the correct behaviour would be to strip '[ham|unsure|spam]'
from subjects when tokenizing if SpamBayes is set to add those
classifications.  This wouldn't be perfect - and maybe would lose
information that something else is adding - but it's probably more reliable
that the current situation (I think it's probably better to not use all the
data than to use incorrect data).  I'll (or whoever) will look into this
when dealing with the feature request.

=Tony Meyer




More information about the Spambayes mailing list