Bayesian kids content filtering in Python?
paulpaterson at users.sourceforge.net
Fri Aug 29 06:47:28 CEST 2003
"Gregory (Grisha) Trubetskoy" <grisha at ispol.com> wrote in message
news:20030828161409.V40715 at onyx.ispol.com...
> I've been snooping around the web for open source kids filtering software.
> Something that runs as an http proxy on my home firewall and blocks
> certain pages based on content.
> It occured to me that this might be an interesting project to be done in
> Python, probably using the same training and scoring mechanism that
> spambayes uses.
> Anyway - I wonder if anyone has already tried something like this?
As Rene points out in his response,after some great advice and discussion
from Skip I gave this a try. It works very well. I added a module to a proxy
server (http://theory.stanford.edu/~amitp/proxy.html) and then 'trained'
Spambayes on top of it by going to sites that I wanted to allow (news sites)
and then ones I wanted to block (sports sites - just to test!). After a
relatively short training period (20-40 sites/pages) it started to pick up
the characteristics of positive and negative sites. It was then easy to get
it to block the negative sites. Although there were still quite a few false
positives I imagine that with a wider training suite it would have been very
accurate (based on the reported accuracy of Spambayes).
Unfortunately, I didn't carry the work through much beyond the initial proof
of concept but I have copied the code I ended up with below. It certainly
seems to work and has application both for parental filtering and other
kinds of content management.
[mod_spambayes.py - place in your proxy folder and ammend the proxy.conf
file to point to this module]
print "Importing Spambayes filter"
from proxy3_filter import *
# Find ham/spam database and suitable folders for archiving
from spambayes import hammie, Options, mboxutils
hamfolder = os.path.join(os.path.split(dbf), ".hampages")
spamfolder = os.path.join(os.path.split(dbf), ".spampages")
def getTempName(ham, spam, ok):
"""Return a temporary file name to archive this file"""
direc = ham
direc = spam
return os.path.join(direc, "arc%d" % time.time())
print "Using db file: %s\nham folder: %s\nspam folder: %s" % (dbf,
hammie = hammie.open(dbf, 1, 'r')
am_learning = 0 # set to 1 when learning
is_ok = 0 # set to 1 if visited page is ok for viewing
prevent_access = 0 # set to 1 to block access to dubious pages
archive_files = 0 # set to 1 to archive files for later training
def filter(self, s):
if self.reply.split() == '200':
msg = "%s\r\n%s" % (self.serverheaders, s)
old = self.hammie.score(msg)
self.hammie.train(msg, not self.is_ok)
new = self.hammie.score(msg)
print "Learned! was=%.5f, now=%.5f" % (old, new)
temp_name = getTempName(hamfolder, spamfolder,
print "Writing %s" % temp_name
f = open(temp_name, "w")
prob = self.hammie.score(msg)
print "| prob: %.5f" % prob
if prob >= Options.options.spam_cutoff and
print "text:", s[0:40], "...", s[-40:]
return "not authorized"
from proxy3_util import *
register_filter('*/*', 'text/html', SpambayesFilter)
print "Spambayes filter installed!"
More information about the Python-list