On 20020919 (Thu) at 1035:04 -0400, Gary Robinson wrote:
At present, it appears that S may perform better than relatively untweaked graham (as I guess bogofilter is using)
Yes, bogofilter 0.7 (haven't looked at the latest release yet) is using untweaked Graham.
1) emails in the middle could be classified as "you should manually check these", or
(1) is a attractive but one thing worries me: it may give people a false sense of security -- they may think they only have to look at the middle ones, but occasionally even extreme cases of seeming to be spam might be legit.
Yup. In the test I reported earler, the one false positive had an S value of 0.93 or thereabouts. It was a message to a bogofilter mailing list and looked extremely innocuous except for a couple of artefacts in the training database (words that ought to have been seen in many messages but actually showed up predominantly in spam). Ironically, one of these was "fetchmail-5.9.14" -- I'd been adding a lot of spam to the corpus since that version came out :) Shows how easy it is to bias this kind of classification; you really need to feed it similar numbers of similarly-dated messages in order to get good training, and shortcuts can be dangerous... -- | G r e g L o u i s | gpg public key: | | http://www.bgl.nu/~glouis | finger greg@bgl.nu |