[XML-SIG] xml / html parsing for webbot
Bastian Kleineidam
calvin@cs.uni-sb.de
Sun, 10 Dec 2000 15:35:14 +0100 (CET)
Hello Kent,
>2. I have think of not building a dom tree but using regular expressions
> to extract all links. Can somebody tell me from their experience some
> comparision of the two approaches? What is better? Especially I found
> some pages which were generated by scripts, do contain unmatched tags in
> the pages. How the two approaches handle them?
I am using Regexps:
_linkMatcher = r"""
(?i) # case insensitive
< # open tag
\s* # whitespace
%s # tag name
\s+ # whitespace
[^>]*? # skip leading attributes
%s # attrib name
\s* # whitespace
= # equal sign
\s* # whitespace
(?P<value> # attribute value
".*?" | # in double quotes
'.*?' | # in single quotes
[^\s>]+) # unquoted
([^">]|".*?")* # skip trailing attributes
> # close tag
"""
# and now fill in some tags:
LinkPatterns = (
re.compile(_linkMatcher % ("a", "href"), re.VERBOSE),
re.compile(_linkMatcher % ("img", "src"), re.VERBOSE),
re.compile(_linkMatcher % ("form", "action"), re.VERBOSE),
re.compile(_linkMatcher % ("body", "background"), re.VERBOSE),
re.compile(_linkMatcher % ("frame", "src"), re.VERBOSE),
re.compile(_linkMatcher % ("link", "href"), re.VERBOSE),
# <meta http-equiv="refresh" content="x; url=...">
re.compile(_linkMatcher % ("meta", "url"), re.VERBOSE),
re.compile(_linkMatcher % ("area", "href"), re.VERBOSE),
re.compile(_linkMatcher % ("script", "src"), re.VERBOSE),
)
This regex even catches missing quotes:
<a href="bla>
<a href=bla">
But only if you strip leading and trailing quotes from the URL.
For a complete code example get Linkchecker:
http://linkchecker.sourceforge.net
and look in linkcheck/UrlData.py
Bastian