HTML Parser

David M. Cooke cookedm at physics.mcmaster.ca
Sat Dec 30 22:13:52 EST 2000


At some point, kragen at dnaco.net (Kragen Sitaker) wrote:

> In article <mailman.978224102.28797.python-list at python.org>,
> Voitenko, Denis <dvoitenko at qode.com> wrote:
> >HTMLtags=re.compile('<.*>')
> 
> In a string like "x<a>b<c>d", this will match "<a>b<c>", because the .*
> matches "a>b<c".  This explains your problem.
> 
> Fixing it is harder.

Not that hard: use the pattern '<.*?>'. This exact pattern is used as
an example in the documentation for the regular expression syntax for
the re module. Here's the relevant paragraph:

*?, +?, ??
  The "*", "+", and "?" qualifiers are all greedy; they match as much
text as possible. Sometimes this behaviour isn't desired; if the RE
<.*> is matched against '<H1>title</H1>', it will match the entire
string, and not just '<H1>'. Adding "?" after the qualifier makes it
perform the match in non-greedy or minimal fashion; as few characters
as possible will be matched. Using .*? in the previous expression will
match only '<H1>'

-- 
|>|\/|<
/--------------------------------------------------------------------------\
|David M. Cooke
|cookedm at mcmaster.ca



More information about the Python-list mailing list