regular expression

Steve Holden sholden at bellatlantic.net
Thu Feb 24 19:20:18 CET 2000


courtneyb wrote:
> 
> I am writing a web interface.  I need to grab a web page, and parse for
> the content between the <pre> and </pre> tags.  I figured I would use
> regular expressions...below is the code:
> 
 [code snipped]

> courtneyb
> courtneyb at big-c.com
Rhe best way to handle almost any kind of HTML comprehension is the
htmllib library.  I wanted to pull out anchors and handle them (I
eventually determined I was initially replicating webchecker from the
distributed Tools directory, but I later diverged enough to justify
my own code).

Basically, establish a class which inherits from htmllib.HTMLParser,
and then override the tag processing methods you don't like.

Here's my modified parser, where the close() method returns a list of
the HREF properties from the anchor tags:

import htmllib
from urllib import basejoin

class myHTMLParser(htmllib.HTMLParser):
    """Modified to return URL references as a list after parsing."""
    
    def __init__(self, formatter, URL, verbose=0):
        htmllib.HTMLParser.__init__(self, formatter, verbose)
        self.rootURL = basejoin("http://",URL)
        self.URLstore = [self.rootURL]
        
    def build_hrefs(self):
	"""Build the list of unique references from the anchor list."""
	# XXX need to treat http://hostname
	# and http://hostname/ as equivalent
	if self.base is None:
	    base = self.rootURL
	else:
	    base = basejoin(self.rootURL, self.base)
	for href in self.anchorlist:
	    ref = basejoin(base,href)
	    if ref[0:7] == "http://" and ref not in self.URLstore:
		self.URLstore.append(ref)
    #
    # Courtney will need a start_pre and end_pre methods here
    #

    def close(self):
	"""Terminate parse and return unique URL list."""
	self.goahead(1)
	self.build_hrefs()
	return self.URLstore

>From your point of view, the remainder of my program isn't really significant,
but you will probably want to import the formatter library.  Since you don't
want any HTML rendering, use a creation and call sequence such as:

    fmt = formatter.NullFormatter()
	.
	.
	.
    try:
	if URL == '-':
	    f = sys.stdin
	else:
	    f = urllib.urlopen(URL)
	    data = f.read()
	    if f is not sys.stdin:
	    f.close()

	p = myHTMLParser(fmt, URL)
	p.feed(data)
	myresult = p.close()
    except:
	print "I feel sick, Dave"

after writing a close() method which returns whatever your <PRE>-handling
methods have extracted.  It may, though, be easier to build your own
formatter which grabs the contents of the <PRE> ... </PRE> pairs as
it's sent out.

This will also handle the mutli-line cases as well.  But perhaps it's over-
complicated for your application, in which case ignore everything I just said!

The libraries are our friends (although sometimes slightly
old-fashioned ones).

regards
 Steve
--
"If computing ever stops being fun, I'll stop doing it"



More information about the Python-list mailing list