[Tutor] re question
Alexandre Ratti
alex at gabuzomeu.net
Fri Aug 8 20:10:00 EDT 2003
Hello Jonathan,
Jonathan Hayward http://JonathansCorner.com wrote:
> I'm trying to use regexps to find the contents of all foo tags. So, if I
> gave the procedure I'm working on an HTML document and asked for
> "strong" tags, it would return a list of strings enclosed in <strong>
> </strong> in the original.
>
> I'm having trouble with the re; at the moment the re seems to return
> only the first instance. What am I doing wrong?
>
> def get_tag_contents_internal(self, tag, file_contents):
> result = []
> # At present only matches first occurrence. Regexp should be
> worked on.
> my_re = re.compile(".*?(<" + tag + ".*?>(.*?)</" + tag + \
> ".*?>.*?)+.*?", re.IGNORECASE)
To retrieve all matches, you can use findall(). Also, your expression
may be matching too much. Here is an example that seems to work:
>>> import re
>>> s = """text text <strong>this is strong</strong> text <strong>this
is strong too</strong>"""
>>> tag = "strong"
>>> pattern = re.compile("<%s.*?>(.*?)</%s.*?>" % (tag, tag))
>>> pattern.findall(s)
['this is strong', 'this is strong too']
To extract data from HTML files, you may also want to look at the
'HTMLParser', 'htmllib' and 'sgmllib' modules:
http://www.python.org/doc/current/lib/markup.html
Cheers.
Alexandre
More information about the Tutor
mailing list