[Tutor] Mad Villain Seeks Source Review.

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Tue, 21 Aug 2001 00:31:40 -0700 (PDT)


On Tue, 21 Aug 2001, Tesla Coil wrote:

> 
> > Well, why don't you try searching by attribute name
> > instead of color name?  That way you would only have
> > to deal with
> > 'color=\s*("([^" ]+)"|\'([^\' ]+)\'|([^"' ]+))' 
> > instead.
> 
> Hadn't considered that--but I'm even less certain what
> attribute names might be encountered.  Some would have
> 'color' followed by intervening alphabetic characters 
> prior to the possible space (can't afford to be greedy
> there) and '=', eg., bordercolordark, bordercolorlight.
> Then there's 'text' and, who knows what all else?
> 
> OTOH, a Diabolical Plan to Reduce the World Wide Web
> to 256 Grayscale would be much less in order if page
> source were not such an unpredictable mess. <0.5 wink>

Hmmm... so would this diabolical plan only wrap its terrible tendrils
around HTML?  If so, you might be able to avoid the parsing issue
altogether by using the HTMLParser class from htmllib.  Here is the
darkness that is the BlackHoleParser.py:



###
"""BlackHoleParser.py: another demonstration of
sgmllib.  Sucks the color right out of a page.

Danny Yoo (dyoo@hkn.eecs.berkeley.edu)

Note: I need to use sgmllib: if I use htmllib, not all the
tags get intercepted by unknown_starttag().

Just feed() in a page into this parser, and then getPage() it
to recover the digested remains.
"""


import sgmllib
from formatter import NullFormatter


class BlackHoleParser(sgmllib.SGMLParser):
    def __init__(self):
        sgmllib.SGMLParser.__init__(self,
                                    NullFormatter())
        self.content = []


    def getPage(self):
        """Returns the blackened and charred remains of
        the document as a string."""
        return ''.join(self.content)


    def handle_data(self, data):
        self.content.append(data)


    def handle_comment(self, comment):
        self.content.append('<!-- %s -->' % comment)
        

    def unknown_starttag(self, tag, attrs):
        """We intercept all start tags, and attack
        all color attributes."""
        new_attrs = []
        for name, value in attrs:
            if name in ('color', 'bgcolor'):
                value = "#000000"
            new_attrs.append( (name, value) )
        attr_str = makeAttributeString(new_attrs)
        if attr_str:
            new_starttag = '<%s %s>' % (tag.upper(),
                                        attr_str )
        else:
            new_starttag = '<%s>' % tag.upper()
        self.content.append(new_starttag)


    def unknown_endtag(self, tag):
        self.content.append("</%s>" % tag)

    def unknown_charref(self, ref):
        self.content.append(ref)

    def unknown_entityref(self, ref):
        self.content.append(ref)
    


def makeAttributeString(attributes):
    joined_tags = []
    for name, value in attributes:
        joined_tags.append('%s="%s"' %
                           (name.upper(), value))
    return ' '.join(joined_tags)



if __name__ == '__main__':
    import urllib
    import sys
    blackhole = BlackHoleParser()
    spacecraft = urllib.urlopen(sys.argv[1]).read()
    blackhole.feed(spacecraft)
    print blackhole.getPage()
###



It eats color attributes for lunch.  We can see a blow by blow account of
what it does to http://python.org:

###
dyoo@coffeetable:~$ python BlackHoleParser.py http://python.org
<HTML>
<!--  THIS PAGE IS AUTOMATICALLY GENERATED.  DO NOT EDIT.  -->

                          [... oops.  Hmmm... Let's skip some lines.]

<BODY BGCOLOR="#000000" TEXT="#000000" TOPMARGIN="0" LEFTMARGIN="0"
MARGINWIDTH="0" MARGINHEIGHT="0" LINK="#0000bb" VLINK="#551a8b"
ALINK="#ff0000">
                                               [ more lines skipped ]

<TR><TD BGCOLOR="#000000"><B><FONT COLOR="#000000">
Special topics
</font></b></td></tr>
<TR><TD BGCOLOR="#000000">
<A HREF="topics/">Topic Guides</a>
</td></tr>
<TR><TD BGCOLOR="#000000">
<A HREF="2.2/">Python 2.2</a>
</td></tr>
<TR><TD BGCOLOR="#000000">
###


So it appears to work.  It shouldn't be too hard to tone down its
harshness to emit shades of gray.

Hope this helps!