Bayesian kids content filtering in Python?

Paul Paterson paulpaterson at
Fri Aug 29 06:47:28 CEST 2003

"Gregory (Grisha) Trubetskoy" <grisha at> wrote in message
news:20030828161409.V40715 at
> I've been snooping around the web for open source kids filtering software.
> Something that runs as an http proxy on my home firewall and blocks
> certain pages based on content.
> It occured to me that this might be an interesting project to be done in
> Python, probably using the same training and scoring mechanism that
> spambayes uses.
> Anyway - I wonder if anyone has already tried something like this?

As Rene points out in his response,after some great advice and discussion
from Skip I gave this a try. It works very well. I added a module to a proxy
server ( and then 'trained'
Spambayes on top of it by going to sites that I wanted to allow (news sites)
and then ones I wanted to block (sports sites - just to test!). After a
relatively short training period (20-40 sites/pages) it started to pick up
the characteristics of positive and negative sites. It was then easy to get
it to block the negative sites. Although there were still quite a few false
positives I imagine that with a wider training suite it would have been very
accurate (based on the reported accuracy of Spambayes).

Unfortunately, I didn't carry the work through much beyond the initial proof
of concept but I have copied the code I ended up with below. It certainly
seems to work and has application both for parental filtering and other
kinds of content management.


[  - place in your proxy folder and ammend the proxy.conf
file to point to this module]

print "Importing Spambayes filter"

import os

from proxy3_filter import *
import proxy3_options

# Find ham/spam database and suitable folders for archiving
from spambayes import hammie, Options, mboxutils
dbf =
hamfolder = os.path.join(os.path.split(dbf)[0], ".hampages")
spamfolder = os.path.join(os.path.split(dbf)[0], ".spampages")

import time

def getTempName(ham, spam, ok):
    """Return a temporary file name to archive this file"""
    if ok:
        direc = ham
        direc = spam
    return os.path.join(direc, "arc%d" % time.time())

print "Using db file: %s\nham folder: %s\nspam folder: %s" % (dbf,
hamfolder, spamfolder)

class SpambayesFilter(BufferAllFilter):
    hammie =, 1, 'r')

    am_learning = 0 # set to 1 when learning
    is_ok = 0 # set to 1 if visited page is ok for viewing
    prevent_access = 0 # set to 1 to block access to dubious pages
    archive_files = 0 # set to 1 to archive files for later training

    def filter(self, s):
        if self.reply.split()[1] == '200':
            msg = "%s\r\n%s" % (self.serverheaders, s)
            if self.am_learning:
                old = self.hammie.score(msg)
                self.hammie.train(msg, not self.is_ok)
                new = self.hammie.score(msg)
                print "Learned! was=%.5f, now=%.5f" % (old, new)
                if self.archive_files:
                        temp_name = getTempName(hamfolder, spamfolder,
                        print "Writing %s" % temp_name
                        f = open(temp_name, "w")
                prob = self.hammie.score(msg)
                print "|  prob: %.5f" % prob
                if prob >= Options.options.spam_cutoff and
                    print self.serverheaders
                    print "text:", s[0:40], "...", s[-40:]
                    return "not authorized"
        return s

from proxy3_util import *

register_filter('*/*', 'text/html', SpambayesFilter)

print "Spambayes filter installed!"

More information about the Python-list mailing list