[Tutor] HTMLParser question

Fri, 19 Apr 2002 22:54:25 -0400

On Friday 19 April 2002 08:23 pm, Daryl Gallatin wrote:
> Hi, is there a good example somewhere on how to use the HTMLParser class
> I would like to create a simple HTML web browser, but the simple example
> provided in the documentation doesn't give me much to work on
>
> I have a text browser, but I want to be able to view at least Simple HTML
> pages

If you're saying you want a GUI-based rendering of HTML that 
pays attention to a reduced tag set (e.g. <p> and some formatting),
that's a non-trivial application.  

As a learning exercise it might be fun, but the practical solution is 
to just use a browser you've already got.

HTMLParser class is good for stripping information from a web 
page, like maybe you just want what's between <pre> </pre> tags.

Below is a program you can run from the OS command line, to harvest
polyhedron data in OFF format (a text format used in computational 
geometry) from a specific website.  Parameters might be cube.html 
or icosa.html (these get appended to a base URL that's hardwired into 
the program).  

Following Lundh's advice, I just use the SGMLParser, and find I need 
to capture <pre> and </pre> tags by overwriting the unknown_starttag 
and unknown_endtag methods:

#!/usr/bin/python
"""
Thanks to example in Puthon Standard Library, Lundh (O'Reilly)

Excerpts data from http://www.scienceu.com/geometry/facts/solids/
which just happens to be saved between <pre> </pre> tags, which
are unique and/or first on each page.
"""

import urllib,sys
import sgmllib

class FoundPre(Exception):
    pass

class ExtractPre(sgmllib.SGMLParser):

    def __init__(self,verbose=0):
        sgmllib.SGMLParser.__init__(self,verbose)
        self.pretag = self.data = None

    def handle_data(self,data):
        if self.data is not None: # skips adding unless <pre> found
            self.data.append(data)

    def unknown_starttag(self, tag, attrs):
        if tag=="pre":
           self.start_pre(attrs)

    def unknown_endtag(self, tag):
        if tag=="pre":
           self.end_pre(attrs)

    def start_pre(self,attrs):
        print "Yes!!!"  # found my <pre> tag
        self.data = []

    def end_pre(self):
        self.pretag = self.data
        raise FoundPre # done parsing

def getwebdata(wp):
    p  = ExtractPre()
    n = 0
    try:  # clever use of exception to terminate
        while 1:
            s = wp.read(512)
            if not s:
                break
            p.feed(s)
        p.close()
    except FoundPre:
        return p.pretag
    return None

if __name__ == '__main__':
    webpage = sys.argv[1]
    baseurl = "http://www.scienceu.com/geometry/facts/solids/coords/"
    fp = urllib.urlopen(baseurl + webpage)
    output = open("data.txt","w")
    results = getwebdata(fp)
    fp.close()

    if results:
        for i in results:
            output.write(i)
    output.close()

Example usage:

[kirby@grunch bin]$ scipoly.py cube.html
Yes!!!
[kirby@grunch bin]$ cat data.txt

  OFF
     8    6    0
    -0.469     0.000    -0.664
     0.469     0.000     0.664
    -0.469     0.664     0.000
    -0.469    -0.664     0.000
     0.469     0.664     0.000
    -0.469     0.000     0.664
     0.469     0.000    -0.664
     0.469    -0.664     0.000

  4   3 7 1 5     153  51 204
  4   1 7 6 4     153  51 204
  4   4 1 5 2     153  51 204
  4   5 2 0 3     153  51 204
  4   6 0 3 7     153  51 204
  4   4 6 0 2     153  51 204

I have another Python script to read in the above file and convert
it to Povray for rendering.  Unfortunately, the data is to only 3 
significant figures, and this means some facets aren't coplanar 
enough for Povray's tastes (qhull likewise sometimes gets different
facets, when parsing the same vertices -- I need a better source
of coordinate data I guess).

Kirby