[XML-SIG] Extracting info from XHTML with Xpath

Tim Wilson wilson at visi.com
Thu Mar 25 01:20:21 EST 2004


I've got a ton to learn about XML processing, but I was able to piece the
following together using libxml2 and Simon Willison's information at
http://simon.incutio.com/archive/2003/10/21/xpathRocks

#!/usr/bin/python

import libxml2
import urllib2

url = 
'http://www.hopkins.k12.mn.us/Pages/district/special/pq/timelytopics.html'

dom = libxml2.parseDoc(urllib2.urlopen(url).read())
ctxt = dom.xpathNewContext()
ctxt.xpathRegisterNs('xhtml', 'http://www.w3.org/1999/xhtml')

titles = [t.content for t in
ctxt.xpathEval('//xhtml:h3[@class="coursetitle"]')]
newtitles = []
for title in titles:
    newtitles.append(' '.join([word.strip() for word in title.split()]))
newtitles.sort()
for title in newtitles:
    print title

I couldn't find any way to remove extraneous whitespace from the tag
contents without all the splitting, stripping, and joining.

-Tim

-- 
Tim Wilson
Twin Cities, Minnesota, USA
Educational technology guy, Linux and OS X fan, Grad. student, Daddy
mailto: wilson at visi.com   aim: tis270   public key: 0x8C0F8813




More information about the XML-SIG mailing list