Bayesian kids content filtering in Python?

Fri Aug 29 00:47:28 EDT 2003

"Gregory (Grisha) Trubetskoy" <grisha at ispol.com> wrote in message
news:20030828161409.V40715 at onyx.ispol.com...
>
> I've been snooping around the web for open source kids filtering software.
> Something that runs as an http proxy on my home firewall and blocks
> certain pages based on content.
>
> It occured to me that this might be an interesting project to be done in
> Python, probably using the same training and scoring mechanism that
> spambayes uses.
>
> Anyway - I wonder if anyone has already tried something like this?

As Rene points out in his response,after some great advice and discussion
from Skip I gave this a try. It works very well. I added a module to a proxy
server (http://theory.stanford.edu/~amitp/proxy.html) and then 'trained'
Spambayes on top of it by going to sites that I wanted to allow (news sites)
and then ones I wanted to block (sports sites - just to test!). After a
relatively short training period (20-40 sites/pages) it started to pick up
the characteristics of positive and negative sites. It was then easy to get
it to block the negative sites. Although there were still quite a few false
positives I imagine that with a wider training suite it would have been very
accurate (based on the reported accuracy of Spambayes).

Unfortunately, I didn't carry the work through much beyond the initial proof
of concept but I have copied the code I ended up with below. It certainly
seems to work and has application both for parental filtering and other
kinds of content management.

Paul

----
[mod_spambayes.py  - place in your proxy folder and ammend the proxy.conf
file to point to this module]

print "Importing Spambayes filter"

import os

from proxy3_filter import *
import proxy3_options

#
# Find ham/spam database and suitable folders for archiving
from spambayes import hammie, Options, mboxutils
dbf =
os.path.expanduser(Options.options.hammiefilter_persistent_storage_file)
#
hamfolder = os.path.join(os.path.split(dbf)[0], ".hampages")
spamfolder = os.path.join(os.path.split(dbf)[0], ".spampages")

import time

def getTempName(ham, spam, ok):
    """Return a temporary file name to archive this file"""
    if ok:
        direc = ham
    else:
        direc = spam
    return os.path.join(direc, "arc%d" % time.time())

print "Using db file: %s\nham folder: %s\nspam folder: %s" % (dbf,
hamfolder, spamfolder)

class SpambayesFilter(BufferAllFilter):
    hammie = hammie.open(dbf, 1, 'r')

    am_learning = 0 # set to 1 when learning
    is_ok = 0 # set to 1 if visited page is ok for viewing
    prevent_access = 0 # set to 1 to block access to dubious pages
    archive_files = 0 # set to 1 to archive files for later training

    def filter(self, s):
        if self.reply.split()[1] == '200':
            msg = "%s\r\n%s" % (self.serverheaders, s)
            if self.am_learning:
                old = self.hammie.score(msg)
                self.hammie.train(msg, not self.is_ok)
                new = self.hammie.score(msg)
                self.hammie.store()
                print "Learned! was=%.5f, now=%.5f" % (old, new)
                if self.archive_files:
                    try:
                        temp_name = getTempName(hamfolder, spamfolder,
self.is_ok)
                        print "Writing %s" % temp_name
                        f = open(temp_name, "w")
                        f.write(msg)
                    finally:
                        f.close()
            else:
                prob = self.hammie.score(msg)
                print "|  prob: %.5f" % prob
                if prob >= Options.options.spam_cutoff and
self.prevent_access:
                    print self.serverheaders
                    print "text:", s[0:40], "...", s[-40:]
                    return "not authorized"
        return s

from proxy3_util import *

register_filter('*/*', 'text/html', SpambayesFilter)

print "Spambayes filter installed!"