[Tutor] re question
Jeff Shannon
jeff at ccvcorp.com
Fri Aug 8 15:40:14 EDT 2003
Jonathan Hayward http://JonathansCorner.com wrote:
> Jeff Shannon wrote:
>
>> tpc at csua.berkeley.edu wrote:
>>
>>> hello Jonathan, you should use re.findall as re.match only returns the
>>> first instance. By the way I would recommend the htmllib.HTMLParser
>>> module instead of reinventing the wheel.
>>>
>>
>> Indeed, it's not just reinventing the wheel. Regular expressions, by
>> themselves, are insufficient to do proper HTML parsing, because re's
>> don't remember state and can't deal with nested/branched data
>> structures (which HTML/XML/SGML are). As someone else pointed out,
>> you're likely to grab too much, or not enough. Anybody seriously
>> trying to do anything with HTML should be using HTMLParser, *not* re.
>>
> Hmm...
>
> I looked through the library docs on this, and tried to do it with
> re's because figuring out how to use HTMLParser looked like more work
> than using re's -- 3 hours' documentation search to avoid one hour of
> reinventing the wheel.
Depending on just how limited your needs are, you *might* be able to get
away with just using re's -- but you might also get bitten.
> What I'd like to do is:
>
> Find all instances of such-and-such between two tags (for which I've
> received a helpful response).
This is trickier than it sounds. Look at the string
"This is <strong>bad</strong>. Very <strong>bad</strong>."
Within this string, there are *three* instances of something that's
between paired tags -- "bad", twice, and this:
"bad</strong>. Very <strong>bad"
A regular expression will have difficulty distinguishing between these.
Of course, you'll immediately thing, just match the closest one! That
doesn't work either, though. Consider a nested list --
<ol>
<li>
<ol><li>One</li><li>Two</li></ol>
</li>
<li>One, again</li>
</ol>
Here, if you're set to match the first closing tag you see, then the
<ol> from the outer list will be matched with the </ol> from the inner
list. Not good.
In fact, there is no way to construct a normal RE that can work right
for both of these cases. In order to do that, you need a proper parser,
which will include a state machine. HTMLParser can handle both of these
cases with no problems.
> Strip out all (or possibly all-but-whitelist) tags from an HTML page
> (substitute "" for "<.*?>" over multiple lines?).
This could be done with RE's, provided that you're not worried about
preserving the meaning of the tag. You're throwing away the
tree-structure of your data, so the limitations of REs don't apply.
> Iterate over links / images and selectively change the targets /
> sources (which would take me a lot of troubleshooting to do with RE).
This could possibly work, too -- an <img src="..."> tag is complete in
and of itself, and therefore can be safely recognized by a RE. You
could work out a RE that would identify the contents of the src
attribute, and substitute it with some transformed version of itself.
Similarly, you could identify and modify the href attribute of an <a>
tag -- but you could not safely identify the text between <a> and </a>,
because REs can't properly capture the necessary level of structure.
> Possibly other things; I'm not trying to compete with HTMLParser but
> get some basic functionality. I'd welcome suggestions on how to do
> this with HTMLParser.
Depending on how basic your functionality is (i.e., if you know for
absolute certain that you will *never* have nested tags), you may be
able to get away with REs -- after all, if you're firing a gun across
the room, the path of the bullet might as well be straight. But if
you're firing artillery, then you can't make that simplification. And
even when you're firing across the room... if the shot ends up going out
the window, then suddenly you're in a whole different paradigm and your
simplification won't work, and may cause a whole heap of unexpected
problems.
(I've never needed to parse HTML myself, so I haven't looked into how to
use HTMLParser, and I can't offer specific code to help with your
problems. I've just picked up enough about parsing in general to know
that REs alone are not up to the task.)
Jeff Shannon
Technician/Programmer
Credit International
More information about the Tutor
mailing list