[Spambayes] Advantages of S?

Greg Louis glouis@dynamicro.on.ca
Thu, 19 Sep 2002 10:53:15 -0400


On 20020919 (Thu) at 1035:04 -0400, Gary Robinson wrote:
> At present, it appears that S may perform better than relatively untweaked
> graham (as I guess bogofilter is using)

Yes, bogofilter 0.7 (haven't looked at the latest release yet) is using
untweaked Graham.

> 1) emails in the middle could be classified as "you should manually check
> these", or
> 
> (1) is a attractive but one thing worries me: it may give people a false
> sense of security -- they may think they only have to look at the middle
> ones, but occasionally even extreme cases of seeming to be spam might be
> legit.

Yup.  In the test I reported earler, the one false positive had an S
value of 0.93 or thereabouts.  It was a message to a bogofilter mailing
list and looked extremely innocuous except for a couple of artefacts in
the training database (words that ought to have been seen in many
messages but actually showed up predominantly in spam).  Ironically,
one of these was "fetchmail-5.9.14" -- I'd been adding a lot of spam
to the corpus since that version came out :)  Shows how easy it is to
bias this kind of classification; you really need to feed it similar
numbers of similarly-dated messages in order to get good training, and
shortcuts can be dangerous...

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg@bgl.nu |