[Tutor] re question
Jeff Shannon
jeff at ccvcorp.com
Fri Aug 8 12:28:18 EDT 2003
tpc at csua.berkeley.edu wrote:
>hello Jonathan, you should use re.findall as re.match only returns the
>first instance. By the way I would recommend the htmllib.HTMLParser
>module instead of reinventing the wheel.
>
Indeed, it's not just reinventing the wheel. Regular expressions, by
themselves, are insufficient to do proper HTML parsing, because re's
don't remember state and can't deal with nested/branched data structures
(which HTML/XML/SGML are). As someone else pointed out, you're likely
to grab too much, or not enough. Anybody seriously trying to do
anything with HTML should be using HTMLParser, *not* re.
Jeff Shannon
Technician/Programmer
Credit International
More information about the Tutor
mailing list