Filtering web proxy

Neil Schemenauer nascheme at enme.ucalgary.ca
Tue Apr 18 00:17:50 CEST 2000


Erno Kuusela <erno at iki.fi> wrote:
>a html parser would need to work incrementally, unless you want to
>wait for the whole document to be transferred over the network before
>seeing any of it rendered.

Yes and if your connection is fast enough that you don't need
incremental loading you probably don't care too much about ads.
In my experience, filtering ads greatly enhances your experience
if your browsing on a slow connection.

>i guess you could do it incrementally with sgmllib (iirc you feed it a
>file object?), but you run into the fact that a big part of
>the html documents on the web are malformed and rely on the
>error correcting heuristics of the major browsers to function...

Right and there seems to be a lot of bad HTML code out there.
Unfortunately, I don't think you can easily make sgmllib parse
incrementally.  Someone please correct me if I'm wrong.

Is the situation with XML the same as HTML?  Are XML documents
forced to adhere to the standard or are parsers supposed to try
to do something intelligent with whatever crap they get fed?

>one starting point could be the "gray proxy" (i forget what it was
>really called). that was written on top of medusa, i think there was
>an announcement here? probably a year or so ago.  it parsed the html
>and changed all the colors to grayscale, and did the same for
>images. medusa isn't free though.. (except the version in zope?)

You can try my "munchy" proxy.  Its at:

    http://www.enme.ucalgary.ca/~nascheme/python/

Saying that is parses HTML is a bit of a stretch however.  It
just uses a couple of regexs.  I'm sure Tim Peters would love
it. :)

In my experience, filtering ads at the HTML level is more
effective than filtering at the request level (like junkbuster).
My list of blocked URLs is very short but still catches ads on
almost all the sites I visit.  Also, when ads are filtered I
usually cannot tell by looking at the page.  Of course, YMMV.


    Neil

-- 
HTML needs a rant tag. --Alan Cox



More information about the Python-list mailing list