Regex Help

Fredrik Lundh fredrik at
Mon Sep 22 19:25:26 CEST 2008

Support Desk wrote:

> the code I am using is 
> regex = r'<a href=["|\']([^"|\']+)["|\']>'

that's way too fragile to work with real-life HTML (what if the link has 
a TITLE attribute, for example?  or contains whitespace after the HREF?)

you might want to consider using a real HTML parser for this task.

> page_text = urllib.urlopen('')
> page_text =
> links = re.findall(regex, text, re.IGNORECASE)

the RE looks fine for the subset of all valid A elements that it can 
handle, though.

got any examples of pages where you see that behaviour?


More information about the Python-list mailing list