Bayesian kids content filtering in Python?
John J. Lee
jjl at pobox.com
Fri Aug 29 20:16:14 CEST 2003
"Paul Paterson" <paulpaterson at users.sourceforge.net> writes:
> "Gregory (Grisha) Trubetskoy" <grisha at ispol.com> wrote in message
> news:20030828161409.V40715 at onyx.ispol.com...
> > I've been snooping around the web for open source kids filtering software.
> > Something that runs as an http proxy on my home firewall and blocks
> > certain pages based on content.
> > It occured to me that this might be an interesting project to be done in
> > Python, probably using the same training and scoring mechanism that
> > spambayes uses.
> > Anyway - I wonder if anyone has already tried something like this?
> As Rene points out in his response,after some great advice and discussion
> from Skip I gave this a try. It works very well. I added a module to a proxy
> of concept but I have copied the code I ended up with below. It certainly
> seems to work and has application both for parental filtering and other
> kinds of content management.
This same idea occurred to me a while ago, but there is one obvious
problem: email filters and the category of web filters that you're
talking about have rather different problems to solve. Email filters
are usually designed to work in the user's interests and act on
content that is sent to the user by others; false negatives are not
very important. Web filters are often designed to work against the
user's interests and act on material that is actively retrieved by the
user (and so might change completely from training time to use time);
false negatives are important.
It will no doubt work well for situations where you want to, for
example, block pop-ups, advertising, and other stuff that one tends to
bump into whilst going about one's normal surfing business. But if
somebody (children, employees, and other people not to be trusted ;-)
is actually trying to work around your barriers, there are always
likely to be false negatives: sites of a flavour that you've never
seen before that you'd wish would trigger your defences, but won't.
If the filter has never even seen that *kind* of page before, it can't
be expected to work. Unfortunately (or fortunately, depending on the
case at hand), there are many kinds of pages that people want to
censor, and you're not going to block them all. It may work well most
of the time, but is that enough? What's needed here, perhaps, is an
open effort to train on categories of things that people would like to
block. That might be enough, since I suppose *most* things you're
trying to block, in the case of kids, are not actually targetted at
them, so arms races are not likely to develop.
In no way invalidates the idea, of course -- just limits it.
More information about the Python-list