is there a safe/easy way to search and replace text w/in an html page?

Levi Cook levicook at earthlink.net
Sat Nov 27 17:20:43 EST 1999


Hello,

I'm new to python and I've been tasked w/ writing a script that fetches
a web-page and highlights the term that the user searched for. Passing
the URL and search term to my script was a cinch and fetching the page
was also quite easy.

The crux of my problem lies with establishing a strategy for searching
out the users term and replacing it with css tags surrounding the
original term.
At first glance the re module seemed sufficient (and maybe still is ),
but it doesn't really have any knowledge of HTML semantics. My next
venture was to look at the htmllib module. This appears to parse the
html fine, but i'm still at a bit of a loss on what this actually does
for me, particularly how do I get my parser to output my page.

Does anyone know where I can find code that does anything remotely like
this?
Does anyone know where I can find good reference on the htmllib module,
the python library reference is pretty bleak in this area.

Thanks in advance for you time and help.
Levi Cook
levicook at earthlink.net


If anyone's interested here's a bit of the code i'm using for this...

<!--$

startTag = '<font class = highlight><strong>'
endTag = '</strong></font>'
cssTag = '<style type="text/css">.highlight {background-color:
#FFFF99}</style>'

if query.has_key("qt"): qt = string.join(query["qt"],',')
else qt = ""

import urllib
if query.has_key("address"): address = query["address"][0]
else address = urllib.quote("http://papajo.com")
address = urllib.unquote(address)

import urlparse
URL2Fetch = urlparse.urlparse(address)

# fetch the page
import httplib
h = httplib.HTTP(URL2Fetch[1])
h.putrequest('GET', URL2Fetch[2])
h.putheader('Accept', 'text/html')
h.putheader('Accept', 'text/plain')
h.endheaders()
errcode, errmsg, headers = h.getreply()
f = h.getfile()
data = f.read()
f.close()

# insert <base> so that all embeded URL's resolve correctly
termstring = ''(^|\\W)(<head>)($|\\W)'
termpat = re.compile(termstring, re.IGNORECASE)
data = re.sub(termpat, '<head>\n<base href="' + address + '">\n' +
cssTag, data)

-->

<!--$write(data)-->






More information about the Python-list mailing list