sholden at bellatlantic.net
Thu Feb 24 13:20:18 EST 2000
> I am writing a web interface. I need to grab a web page, and parse for
> the content between the <pre> and </pre> tags. I figured I would use
> regular expressions...below is the code:
> courtneyb at big-c.com
Rhe best way to handle almost any kind of HTML comprehension is the
htmllib library. I wanted to pull out anchors and handle them (I
eventually determined I was initially replicating webchecker from the
distributed Tools directory, but I later diverged enough to justify
my own code).
Basically, establish a class which inherits from htmllib.HTMLParser,
and then override the tag processing methods you don't like.
Here's my modified parser, where the close() method returns a list of
the HREF properties from the anchor tags:
from urllib import basejoin
"""Modified to return URL references as a list after parsing."""
def __init__(self, formatter, URL, verbose=0):
htmllib.HTMLParser.__init__(self, formatter, verbose)
self.rootURL = basejoin("http://",URL)
self.URLstore = [self.rootURL]
"""Build the list of unique references from the anchor list."""
# XXX need to treat http://hostname
# and http://hostname/ as equivalent
if self.base is None:
base = self.rootURL
base = basejoin(self.rootURL, self.base)
for href in self.anchorlist:
ref = basejoin(base,href)
if ref[0:7] == "http://" and ref not in self.URLstore:
# Courtney will need a start_pre and end_pre methods here
"""Terminate parse and return unique URL list."""
>From your point of view, the remainder of my program isn't really significant,
but you will probably want to import the formatter library. Since you don't
want any HTML rendering, use a creation and call sequence such as:
fmt = formatter.NullFormatter()
if URL == '-':
f = sys.stdin
f = urllib.urlopen(URL)
data = f.read()
if f is not sys.stdin:
p = myHTMLParser(fmt, URL)
myresult = p.close()
print "I feel sick, Dave"
after writing a close() method which returns whatever your <PRE>-handling
methods have extracted. It may, though, be easier to build your own
formatter which grabs the contents of the <PRE> ... </PRE> pairs as
it's sent out.
This will also handle the mutli-line cases as well. But perhaps it's over-
complicated for your application, in which case ignore everything I just said!
The libraries are our friends (although sometimes slightly
"If computing ever stops being fun, I'll stop doing it"
More information about the Python-list