[Greg Louis, reports on an experiment using f(w) in a modified bogofilter]
... Yup. In the test I reported earler, the one false positive had an S value of 0.93 or thereabouts. It was a message to a bogofilter mailing list and looked extremely innocuous except for a couple of artefacts in the training database (words that ought to have been seen in many messages but actually showed up predominantly in spam). Ironically, one of these was "fetchmail-5.9.14" -- I'd been adding a lot of spam to the corpus since that version came out :) Shows how easy it is to bias this kind of classification; you really need to feed it similar numbers of similarly-dated messages in order to get good training, and shortcuts can be dangerous...
They sure can be -- many of us here learned this the hard way too. It was so bad for me at first (mixed-source data) that I ignored headers entirely; but that forced us to work harder on other things, and that appears to have given us some huge wins. One thing that really helped me: Our datadase has an integer "kill count" attached to each word. Whenever scoring a new message, the words that survive to the end of the Graham scheme (the "most extreme" words in the message) each get their killcount bumped by 1. Whenever I've gotten great results for bogus reasons, looking at the words with the highest killcounts has instantly pinpointed the cause (for example, when I started tokenizing "To" headers, "bruceg" became one of the two most frequent killer clues in the whole database). This kind of thing is easy for us to do so long as we're doing research; I imagine Eric might balk at the additional database burden in bogofilter, but he shouldn't <wink>. The killcount is much less revealing under the S ("f(w)") scheme, btw -- it tends then to reveal merely which words appear most often.