fredrik at pythonware.com
Mon Sep 22 19:25:26 CEST 2008
Support Desk wrote:
> the code I am using is
> regex = r'<a href=["|\']([^"|\']+)["|\']>'
that's way too fragile to work with real-life HTML (what if the link has
a TITLE attribute, for example? or contains whitespace after the HREF?)
you might want to consider using a real HTML parser for this task.
> page_text = urllib.urlopen('http://somesite.com')
> page_text = page_text.read()
> links = re.findall(regex, text, re.IGNORECASE)
the RE looks fine for the subset of all valid A elements that it can
got any examples of pages where you see that behaviour?
More information about the Python-list