Parsing HTML/SGML/XML (was: Re: a regular expression question)

Sat Mar 22 16:50:53 EST 2003

On Sat, 2003-03-22 at 08:05, Roy Smith wrote:
> Second, as somewhat of a meta comment, I think it's a testiment to the 
> complexity of {HT,SG,X}ML parsers everywhere that people are always 
> looking for ways to avoid using them.  Regex is no picnic, yet people 
> seem to prefer trying to use it to doing it "the right way".

I think people are drawn to regexes because the style of programming for
searching is much easier to work with compared to a parser.

What would be interesting is a regex-like interface, which worked on
input that was somewhat more structured than a plain string.

Though the character-level interface is also useful in conjunction --
like in this example it is known that the link text is a number.  It
would be nice if you could say something like, oh, r'(:<a
href="(.*)")([0-9]+)(:</a)(.*)'

Where (:< signalled the start of a tag-search.  In this case, it would
match only a tag, and text inside that would be parsed.  So that
href="..." would actually match href=link, HREF = link, HREF="link" or
even junk like href='link'.  If you included multiple attributes, those
attributes would not be required to be in order.

You might also want an anything-but-a-tag character class (i.e., [^<]),
or other details, like an all-other-attributes identifier to be used in
(:<, so you could match both <a href=blah class=menu> and <a href=blah>
without lots of effort.

This seems very useful, an not unduly complicated.  Now, why doesn't
Perl already have it?

  Ian