Regex Help
Fredrik Lundh
fredrik at pythonware.com
Mon Sep 22 13:25:26 EDT 2008
Support Desk wrote:
> the code I am using is
>
> regex = r'<a href=["|\']([^"|\']+)["|\']>'
that's way too fragile to work with real-life HTML (what if the link has
a TITLE attribute, for example? or contains whitespace after the HREF?)
you might want to consider using a real HTML parser for this task.
> page_text = urllib.urlopen('http://somesite.com')
> page_text = page_text.read()
>
> links = re.findall(regex, text, re.IGNORECASE)
the RE looks fine for the subset of all valid A elements that it can
handle, though.
got any examples of pages where you see that behaviour?
</F>
More information about the Python-list
mailing list