[Tutor] Regex [negative lookbehind / use HTMLParser to parse
HTML]
Danny Yoo
dyoo at hkn.eecs.berkeley.edu
Sun Aug 24 22:00:56 EDT 2003
On Sun, 24 Aug 2003, Andrei wrote:
> I'm quite sure I've seen a question of this type before, but I seem
> unable to find it. How can I match a re pattern ONLY if it is not
> preceded by another re pattern?
Hi Andrei,
We can use the negative lookbehind "(?<! )" syntax:
http://www.python.org/doc/lib/re-syntax.html
For example:
###
>>> regex = re.compile(r'(?<!foo)bar')
>>> regex.search('the chocolate bar melts')
<_sre.SRE_Match object at 0x1e8900>
>>> regex.search('foobar')
>>>
###
> Think for example of finding all URLs in a piece of text, but *not* if
> they are inside link tags and therefore preceded by 'href="'. <a
> href="http://python.org">Python</a> shouldn't give a match, but
> http://python.org on its own should.
The prerequisite "don't parse HTML with regular expressions alone" retort
is instinctively at the tip of my tongue. *grin*
For this particular example, it's a better idea to use regular expressions
in concert with something like HTMLParser:
http://www.python.org/doc/lib/module-HTMLParser.html
For example:
###
>>> regex = re.compile("(http://python.org)")
>>> text = """
... The python.org web site,
... <a href="http://python.org">http://python.org</a>
... is a great resource"""
>>>
>>> regex.findall(text)
['http://python.org', 'http://python.org']
###
Here we see the problem of grabbing http://python.org twice --- we'd like
to avoid looking at tag attributes. To solve this, we can use a parser
that only pays attention to the non-tag data, and run our url-matching
regex on that:
###
>>> import HTMLParser
>>> class Parser(HTMLParser.HTMLParser):
... def __init__(self):
... HTMLParser.HTMLParser.__init__(self)
... self.urls = []
... def handle_data(self, data):
... self.urls.extend(regex.findall(data))
... def getUrls(self):
... return self.urls
...
>>> p = Parser()
>>> p.feed(text)
>>> p.close()
>>> p.getUrls()
['http://python.org']
###
Hope this helps!
More information about the Tutor
mailing list