[Tutor] re question

Jeff Shannon jeff at ccvcorp.com
Fri Aug 8 12:28:18 EDT 2003


tpc at csua.berkeley.edu wrote:

>hello Jonathan, you should use re.findall as re.match only returns the
>first instance.  By the way I would recommend the htmllib.HTMLParser
>module instead of reinventing the wheel.
>

Indeed, it's not just reinventing the wheel.  Regular expressions, by 
themselves, are insufficient to do proper HTML parsing, because re's 
don't remember state and can't deal with nested/branched data structures 
(which HTML/XML/SGML are).  As someone else pointed out, you're likely 
to grab too much, or not enough.  Anybody seriously trying to do 
anything with HTML should be using HTMLParser, *not* re.

Jeff Shannon
Technician/Programmer
Credit International





More information about the Tutor mailing list