steven.bethard at gmail.com
Wed Jun 13 00:06:38 CEST 2007
Rob Wolfe wrote:
> Steven Bethard <steven.bethard at gmail.com> writes:
>> I'd hate to steer a potential new Python developer to a clumsier
> Try to parse this with your program:
> page2 = '''
> <li><a href="http://domain1/page1">some page1</a></li>
> <li><a href="http://domain2/page2">some page2</a></li>
If you want to parse invalid HTML, I strongly encourage you to look into
BeautifulSoup. Here's the updated code:
import ElementSoup # http://effbot.org/zone/element-soup.htm
tree = ElementSoup.parse(cStringIO.StringIO(page2))
for a_node in tree.getiterator('a'):
url = a_node.get('href')
if url is not None:
>> I know that the wiki page is supposed to be Python 2.4 only, but I'd
>> rather have no example than an outdated one.
> This example is by no means "outdated".
Given the simplicity of the ElementSoup code above, I'd still contend
that using HTMLParser here shows too complex an answer to too simple a
More information about the Python-list