Regex Help

Mon Sep 22 13:25:26 EDT 2008

Support Desk wrote:

> the code I am using is 
> 
> regex = r'<a href=["|\']([^"|\']+)["|\']>'

that's way too fragile to work with real-life HTML (what if the link has 
a TITLE attribute, for example?  or contains whitespace after the HREF?)

you might want to consider using a real HTML parser for this task.

> page_text = urllib.urlopen('http://somesite.com')
> page_text = page_text.read()
> 
> links = re.findall(regex, text, re.IGNORECASE)

the RE looks fine for the subset of all valid A elements that it can 
handle, though.

got any examples of pages where you see that behaviour?

</F>