[Tutor] HTMLParser question
Kirby Urner
urnerk@qwest.net
Fri, 19 Apr 2002 22:54:25 -0400
On Friday 19 April 2002 08:23 pm, Daryl Gallatin wrote:
> Hi, is there a good example somewhere on how to use the HTMLParser class
> I would like to create a simple HTML web browser, but the simple example
> provided in the documentation doesn't give me much to work on
>
> I have a text browser, but I want to be able to view at least Simple HTML
> pages
If you're saying you want a GUI-based rendering of HTML that
pays attention to a reduced tag set (e.g. <p> and some formatting),
that's a non-trivial application.
As a learning exercise it might be fun, but the practical solution is
to just use a browser you've already got.
HTMLParser class is good for stripping information from a web
page, like maybe you just want what's between <pre> </pre> tags.
Below is a program you can run from the OS command line, to harvest
polyhedron data in OFF format (a text format used in computational
geometry) from a specific website. Parameters might be cube.html
or icosa.html (these get appended to a base URL that's hardwired into
the program).
Following Lundh's advice, I just use the SGMLParser, and find I need
to capture <pre> and </pre> tags by overwriting the unknown_starttag
and unknown_endtag methods:
#!/usr/bin/python
"""
Thanks to example in Puthon Standard Library, Lundh (O'Reilly)
Excerpts data from http://www.scienceu.com/geometry/facts/solids/
which just happens to be saved between <pre> </pre> tags, which
are unique and/or first on each page.
"""
import urllib,sys
import sgmllib
class FoundPre(Exception):
pass
class ExtractPre(sgmllib.SGMLParser):
def __init__(self,verbose=0):
sgmllib.SGMLParser.__init__(self,verbose)
self.pretag = self.data = None
def handle_data(self,data):
if self.data is not None: # skips adding unless <pre> found
self.data.append(data)
def unknown_starttag(self, tag, attrs):
if tag=="pre":
self.start_pre(attrs)
def unknown_endtag(self, tag):
if tag=="pre":
self.end_pre(attrs)
def start_pre(self,attrs):
print "Yes!!!" # found my <pre> tag
self.data = []
def end_pre(self):
self.pretag = self.data
raise FoundPre # done parsing
def getwebdata(wp):
p = ExtractPre()
n = 0
try: # clever use of exception to terminate
while 1:
s = wp.read(512)
if not s:
break
p.feed(s)
p.close()
except FoundPre:
return p.pretag
return None
if __name__ == '__main__':
webpage = sys.argv[1]
baseurl = "http://www.scienceu.com/geometry/facts/solids/coords/"
fp = urllib.urlopen(baseurl + webpage)
output = open("data.txt","w")
results = getwebdata(fp)
fp.close()
if results:
for i in results:
output.write(i)
output.close()
Example usage:
[kirby@grunch bin]$ scipoly.py cube.html
Yes!!!
[kirby@grunch bin]$ cat data.txt
OFF
8 6 0
-0.469 0.000 -0.664
0.469 0.000 0.664
-0.469 0.664 0.000
-0.469 -0.664 0.000
0.469 0.664 0.000
-0.469 0.000 0.664
0.469 0.000 -0.664
0.469 -0.664 0.000
4 3 7 1 5 153 51 204
4 1 7 6 4 153 51 204
4 4 1 5 2 153 51 204
4 5 2 0 3 153 51 204
4 6 0 3 7 153 51 204
4 4 6 0 2 153 51 204
I have another Python script to read in the above file and convert
it to Povray for rendering. Unfortunately, the data is to only 3
significant figures, and this means some facets aren't coplanar
enough for Povray's tastes (qhull likewise sometimes gets different
facets, when parsing the same vertices -- I need a better source
of coordinate data I guess).
Kirby