html parsing? Or just simple regex'ing?

Wed Nov 10 17:41:09 EST 2004

On Wed, 10 Nov 2004 12:26:04 +0100, Diez B. Roggisch wrote:

> 
>> So, I've got Basic AUTH going with http, but now I'm faced with the
>> following questions, due to the fact that I need to pull some lists out of
>> HTML, and then make some changes via POST or so, again over HTTP:
>> 
>> 1) Would I be better off just regex'ing the html I'm getting back?  (I
>> suppose this depends on the complexity of the html received, eh?)
>> 
>> 2) Would I be better off feeding the HTML into an HTML parser, and then
>> traversing that datastructure (is that really how it works?)?
> 
> I personally would certainly go that way - the best thing IMHO would be to
> make a dom-tree out of the html you then can work on with xpath. 4suite
> might be good for that. While this seems a bit overengineered at first,
> using xpath allows for pretty strong queries against your dom-tree so even
> larger changes in the "interface" can be coped with. And writing htmlparser
> based class isn't hard, either.

This sounds interesting.

But if I use an XML parser to parse HTML instead of a dedicated HTML
parser, will I still get smart handling of unpaired tags?  I'm not sure we
can count on getting 100% properly formed HTML...