Bayesian kids content filtering in Python?

Tue Sep 2 10:41:57 EDT 2003

I've been looking at this sort of thing for a while now.  I initially
tried what you did about a year and a half ago and ran into some
problems.  I ended up writing my own http proxy, html parser, and
bayes filtering code.  It is being used now by Woodland Hills School
District in Pittsburgh, PA.  The school has 2300 computers, 6000
students, and 3 T1 lines to the Internet.  Most of the problems I ran
into were scalability, but there also were a fair amount of problems
dealing with broken http servers (sending only \r or \n, or nothing,
or not sending all of the headers, or sending wrong headers, etc.) and
goofy problems with all different versions of IE.  I think spambayes
would probably be better than the bayes filter I wrote, and do plan in
the future to try it (it has improved significantly over the last year
and a half).  I have gotten very good sucess though with what I have. 
It is significantly better than the proprietary (and expensive)
filters that are out there on all counts (less false positives, much
less false negatives, faster, etc.)  The code and docs (and some
marketing :-) are at http://www.digitallumber.com/willow.  I'd
appreciate some feedback from experienced python developers.

"Paul Paterson" <paulpaterson at users.sourceforge.net> wrote in message news:<AVA3b.9197$Pn6.1974 at twister.austin.rr.com>...
> "Gregory (Grisha) Trubetskoy" <grisha at ispol.com> wrote in message
> news:20030828161409.V40715 at onyx.ispol.com...
> >
> > I've been snooping around the web for open source kids filtering software.
> > Something that runs as an http proxy on my home firewall and blocks
> > certain pages based on content.
> >
> > It occured to me that this might be an interesting project to be done in
> > Python, probably using the same training and scoring mechanism that
> > spambayes uses.
> >
> > Anyway - I wonder if anyone has already tried something like this?
> 
> As Rene points out in his response,after some great advice and discussion
> from Skip I gave this a try. It works very well. I added a module to a proxy
> server (http://theory.stanford.edu/~amitp/proxy.html) and then 'trained'
> Spambayes on top of it by going to sites that I wanted to allow (news sites)
> and then ones I wanted to block (sports sites - just to test!). After a
> relatively short training period (20-40 sites/pages) it started to pick up
> the characteristics of positive and negative sites. It was then easy to get
> it to block the negative sites. Although there were still quite a few false
> positives I imagine that with a wider training suite it would have been very
> accurate (based on the reported accuracy of Spambayes).
> 
> Unfortunately, I didn't carry the work through much beyond the initial proof
> of concept but I have copied the code I ended up with below. It certainly
> seems to work and has application both for parental filtering and other
> kinds of content management.
> 
> Paul
> 
>