[Tutor] re question

Fri Aug 8 20:10:00 EDT 2003

Hello Jonathan,

Jonathan Hayward http://JonathansCorner.com wrote:
> I'm trying to use regexps to find the contents of all foo tags. So, if I 
> gave the procedure I'm working on an HTML document and asked for 
> "strong" tags, it would return a list of strings enclosed in <strong> 
> </strong> in the original.
> 
> I'm having trouble with the re; at the moment the re seems to return 
> only the first instance. What am I doing wrong?
> 
>    def get_tag_contents_internal(self, tag, file_contents):
>        result = []
>        # At present only matches first occurrence. Regexp should be 
> worked on.
>        my_re = re.compile(".*?(<" + tag + ".*?>(.*?)</" + tag + \
>          ".*?>.*?)+.*?", re.IGNORECASE)

To retrieve all matches, you can use findall(). Also, your expression 
may be matching too much. Here is an example that seems to work:

 >>> import re
 >>> s = """text text <strong>this is strong</strong> text <strong>this 
is strong too</strong>"""
 >>> tag = "strong"
 >>> pattern = re.compile("<%s.*?>(.*?)</%s.*?>" % (tag, tag))
 >>> pattern.findall(s)
['this is strong', 'this is strong too']

To extract data from HTML files, you may also want to look at the 
'HTMLParser', 'htmllib' and 'sgmllib' modules:

	http://www.python.org/doc/current/lib/markup.html

Cheers.

Alexandre