[Tutor] re question

Fri Aug 8 15:40:14 EDT 2003

Jonathan Hayward http://JonathansCorner.com wrote:

> Jeff Shannon wrote:
>
>> tpc at csua.berkeley.edu wrote:
>>
>>> hello Jonathan, you should use re.findall as re.match only returns the
>>> first instance.  By the way I would recommend the htmllib.HTMLParser
>>> module instead of reinventing the wheel.
>>>
>>
>> Indeed, it's not just reinventing the wheel.  Regular expressions, by 
>> themselves, are insufficient to do proper HTML parsing, because re's 
>> don't remember state and can't deal with nested/branched data 
>> structures (which HTML/XML/SGML are).  As someone else pointed out, 
>> you're likely to grab too much, or not enough.  Anybody seriously 
>> trying to do anything with HTML should be using HTMLParser, *not* re.
>>
> Hmm...
>
> I looked through the library docs on this, and tried to do it with 
> re's because figuring out how to use HTMLParser looked like more work 
> than using re's -- 3 hours' documentation search to avoid one hour of 
> reinventing the wheel. 

Depending on just how limited your needs are, you *might* be able to get 
away with just using re's -- but you might also get bitten.

> What I'd like to do is:
>
> Find all instances of such-and-such between two tags (for which I've 
> received a helpful response). 

This is trickier than it sounds.  Look at the string

"This is bad. Very bad."

Within this string, there are *three* instances of something that's 
between paired tags -- "bad", twice, and this:

"bad. Very bad"

A regular expression will have difficulty distinguishing between these. 
 Of course, you'll immediately thing, just match the closest one!  That 
doesn't work either, though.  Consider a nested list --

<ol>
<li>
    <ol><li>One</li><li>Two</li></ol>
</li>
<li>One, again</li>
</ol>

Here, if you're set to match the first closing tag you see, then the 
<ol> from the outer list will be matched with the </ol> from the inner 
list. Not good.

In fact, there is no way to construct a normal RE that can work right 
for both of these cases.  In order to do that, you need a proper parser, 
which will include a state machine.  HTMLParser can handle both of these 
cases with no problems.

> Strip out all (or possibly all-but-whitelist) tags from an HTML page 
> (substitute "" for "<.*?>" over multiple lines?). 

This could be done with RE's, provided that you're not worried about 
preserving the meaning of the tag.  You're throwing away the 
tree-structure of your data, so the limitations of REs don't apply.

> Iterate over links / images and selectively change the targets / 
> sources (which would take me a lot of troubleshooting to do with RE). 

This could possibly work, too -- an <img src="..."> tag is complete in 
and of itself, and therefore can be safely recognized by a RE. You 
could work out a RE that would identify the contents of the src 
attribute, and substitute it with some transformed version of itself. 
 Similarly, you could identify and modify the href attribute of an <a> 
tag -- but you could not safely identify the text between <a> and </a>, 
because REs can't properly capture the necessary level of structure.

> Possibly other things; I'm not trying to compete with HTMLParser but 
> get some basic functionality. I'd welcome suggestions on how to do 
> this with HTMLParser. 

Depending on how basic your functionality is (i.e., if you know for 
absolute certain that you will *never* have nested tags), you may be 
able to get away with REs -- after all, if you're firing a gun across 
the room, the path of the bullet might as well be straight.  But if 
you're firing artillery, then you can't make that simplification.  And 
even when you're firing across the room... if the shot ends up going out 
the window, then suddenly you're in a whole different paradigm and your 
simplification won't work, and may cause a whole heap of unexpected 
problems.

(I've never needed to parse HTML myself, so I haven't looked into how to 
use HTMLParser, and I can't offer specific code to help with your 
problems.  I've just picked up enough about parsing in general to know 
that REs alone are not up to the task.)

Jeff Shannon
Technician/Programmer
Credit International