[Spambayes] Advantages of S?
Greg Louis
glouis@dynamicro.on.ca
Thu, 19 Sep 2002 10:53:15 -0400
On 20020919 (Thu) at 1035:04 -0400, Gary Robinson wrote:
> At present, it appears that S may perform better than relatively untweaked
> graham (as I guess bogofilter is using)
Yes, bogofilter 0.7 (haven't looked at the latest release yet) is using
untweaked Graham.
> 1) emails in the middle could be classified as "you should manually check
> these", or
>
> (1) is a attractive but one thing worries me: it may give people a false
> sense of security -- they may think they only have to look at the middle
> ones, but occasionally even extreme cases of seeming to be spam might be
> legit.
Yup. In the test I reported earler, the one false positive had an S
value of 0.93 or thereabouts. It was a message to a bogofilter mailing
list and looked extremely innocuous except for a couple of artefacts in
the training database (words that ought to have been seen in many
messages but actually showed up predominantly in spam). Ironically,
one of these was "fetchmail-5.9.14" -- I'd been adding a lot of spam
to the corpus since that version came out :) Shows how easy it is to
bias this kind of classification; you really need to feed it similar
numbers of similarly-dated messages in order to get good training, and
shortcuts can be dangerous...
--
| G r e g L o u i s | gpg public key: |
| http://www.bgl.nu/~glouis | finger greg@bgl.nu |