[Tutor] Regex search in HTML data

Alan Gauld alan.gauld at freenet.co.uk
Tue Aug 8 23:06:47 CEST 2006

> Please, see the attachment and examine a code I have provide. The
> problem is, I want fetch data from <H2>Comments</H2> until the first
> </TD> occurrence ,

Do you mean the unmatched /td that occurs after the dd section?

> import re
> import string
> htmlData = """
> <h2>Instructions</h2>....
> <h2>Comments</h2>
> <dl>
>  <dd>None
> </dd></dl>
> </td>

To this one here?
Its probably a bad idea to use a regular tag as a marker,
some browsers get confused by unmatched tags.
Using a comment is usually better.

> <td valign="top" width="50%"><h2>Classification</h2>
> <h2><table border="1" cellpadding="1" cellspacing="0" height="60" 
> width="100%">
> <tbody><tr>
> <td width="50%"><b>&nbsp;Utility:</b></td>

But regex don;t like working with nested tags, you have
a table cell inside another and writing regexs to match
that can get very tricky. So if you want to search into
this part of the string you should probably look at
using Beautiful Soup or similar HTML parser.

> if __name__ == '__main__':
>    # Extract comments
>    p = re.search('<H2>Comments</H2>(.+)</TD>', htmlData,
>                  re.I | re.S | re.M)

Looks like you are getting caught out by the "greedy" nature
of regex - they grab as much as they can.

You can control that by adding a ? immediately after the +
but given the nature of your html I'd try using BeautifulSoup instead.

You'll find a short section on greedy expressions in my regex
topic on my tutorial site.


Alan Gauld
Author of the Learn to Program web site

More information about the Tutor mailing list