high performance hyperlink extraction

Bryan Olson fakeaddress at nowhere.org
Wed Sep 14 00:20:37 EDT 2005


Adam Monsen wrote:
 > The following script is a high-performance link (<a
 > href="...">...</a>) extractor. [...]
 > * extract links from text (most likey valid HTML)
[...]
 > import re
 > import urllib
 >
 > whiteout = re.compile(r'\s+')
 >
 > # grabs hyperlinks from text
 > href_re = re.compile(r'''
 >    <a(?P<attrs>[^>]*     # start of tag
 >    href=(?P<delim>["'])  # delimiter
 >    (?P<link>[^"']*)      # link
 >    (?P=delim)            # delimiter
 >    [^>]*)>               # rest of start tag
 >    (?P<content>.*?)      # link content
 >    </a>                  # end tag
 > ''', re.VERBOSE | re.IGNORECASE)

A few notes:

The single or double quote delimiters are optional in some cases
(and frequently omitted even when required by the current
standard).

Where blank-spaces may appear is HTML entities is not so clear.
To follow the standard, one would have to acquire the SGML
standard, which costs money. Popular browsers allow end tags
such as "</a >" which the RE above will reject.


I'm not good at reading RE's, but it looks like the first line
will greedily match the entire start tag, and then back-track to
find the href attribute. There appear to many other good
opportunities for a cleverly-constructed input to force big-time
backtracking; for example, a '>' will end the start-tag, but in
within the delimiters, it's just another character. Can anyone
show a worst-case run-time?

Writing a Python RE to match all and only legal anchor tags may
not be possible. Writing a regular expression to do so is
definitely not possible.

[...]
 > def getLinks(html_data):
 >    newdata = whiteout.sub(' ', html_data)
 >    matches = href_re.finditer(newdata)
 >    ancs = []
 >    for match in matches:
 >        d = match.groupdict()
 >        a = {}
 >        a['href'] = d.get('link', None)

The statement above doesn't seem necessary. The 'href' just gets
over-written below, as just another attribute.

 >        a['content'] = d.get('content', None)
 >        attr_matches = attrs_re.finditer(d.get('attrs', None))
 >        for match in attr_matches:
 >            da = match.groupdict()
 >            name = da.get('name', None)
 >            a[name] = da.get('value', None)
 >        ancs.append(a)
 >    return ancs


-- 
--Bryan



More information about the Python-list mailing list