Spambayes + HTTP proxy server

Sun Feb 2 01:24:38 EST 2003

jerf at compy.attbi.com wrote:

> On Sat, 01 Feb 2003 22:47:06 +0000, Paul Paterson wrote:
>
> >Does anyone have any experience in this area to say whether this
> >approach is workable?
>
>
> Perfectly workable, though it would probably require some tweaks to the
> tokenizer to work as well as possible.
>
> It would not take long to set up at least a prototype of this.
>
The prototype turned out to be shorter than my original post,

#
# mod_spambayesfilter.py - used by proxy3
#
from spambayes import tokenizer, classifier

class SpamBayesFilter(BufferSomeFilter):
     BUFFER_LEN = 128
     LOWER_BOUND = 0.5

     tok = tokenizer.Tokenizer()
     checker = classifier.Classifier()

     def filter(self, s):
         if checker.chi2_spamprob(t.tokenize(text)) > self.LOWER_BOUND:
             return "Not authorized"
         else:
             return s

register_filter('*/*', 'text/html', SpamBayesFilter)

Am I right in thinking that the spambayes tokenizer will just revert to 
splitting up words if it doesn't think it is looking at an email? 
Perhaps this might be sufficient for webpage filtering since web pages 
probably wont be using the same kinds of subtrefuge that spammers resort to.