[spambayes-dev] I took a big step Tuesday...

Skip Montanaro skip at pobox.com
Thu Jul 24 12:53:02 EDT 2003


    Alex> I think that there is some use to finer-grain categories, but I'm
    Alex> not convinced of both (a) the score should be used for such
    Alex> categorization, and (b) it needs to be done in spambayes.

Unless you assume some sort of voting arrangement like you use (SA & SB), I
don't see what else you could base any judgement on other than the SB score.
The reason I was ruminating about something within Spambayes is that it's
not clear that most people will have a non-Spambayes way of dealing with
such issues.  For example, the order in which various filters are applied
within Outlook seems somewhat non-deterministic, so it's possible that SB
marking a message 1.00 will not necessarily be caught by a later
user-defined Outlook filter designed to delete such messages.

Maybe I'm completely mistaken and most mail user agents now have
sufficiently sophisticated rules or filters to allow headers which match
"spam; 1.00" to be deleted automatically.  However, I mentioned the extra
criterion of a "large enough" training database.  If someone retrains from
scratch, those first few messages are likely to be scored rather poorly, and
false positives will be much more likely.  In my case, I keep my training
set around.  When I retrain from scratch, it's on 21,000 messages, so that
initial state where scores can be wildly wrong doesn't happen.  I suspect
most day-to-day users won't keep a handy training set around.

    Alex> I've had a couple false positives recently, both receipts from
    Alex> online purchases.  They both scored very high by spambayes and
    Alex> spamassassin.

That's why I still scan everything which scores between 0.80 and 0.99.

    >> Relating that to spambayes-dev subject matter, perhaps a "super-spam"
    >> cutoff could be created which would automatically delete messages
    >> which score at or above that value if the user's training set was
    >> "large enough".  Thus, if they started training from scratch it would
    >> have no effect.  By default, it would be set to something > 1.0 to
    >> prevent it from coming into play unexpectedly.  I don't know what
    >> "large enough" is though.

    Alex> I don't think that such extra gradation needs to be in the
    Alex> spambayes code; the obvious super-spam at 1.00 is easily matched
    Alex> by MDAs already, and more generic benefit is derived from using a
    Alex> completely different method such as spamassassin to further
    Alex> categorize.

Maybe it's just something we should document in the FAQ.  There is question
3.10, but it's specific to Outlook.

Skip



More information about the spambayes-dev mailing list