[Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike
John J Lee
jjl at pobox.com
Wed Dec 3 10:40:58 EST 2003
On Wed, 3 Dec 2003, Casey Duncan wrote:
> On Wed, 3 Dec 2003 14:23:00 +0000 (GMT) John J Lee <jjl at pobox.com> wrote:
> > from tidy import tidy
> > xhtml = tidy(html)
> That would be a pretty easy wrapper methinks. At first that was pretty
> much all I thought tidylib would do, but it exposes its object model in
> such a way that you could parse HTML directly to a DOM if you wanted to.
Loss is inevitable if you're tidying. How could it be otherwise?
Usually you don't get huge DOMs from HTML documents, unlike XML, so that's
not a major problem -- I hope! Marc-Andre's page talks about poor
performance from HTMLTidy due to character-based operation, but I don't
know how severe that is or whether it's been addressed in tidylib.
4DOM seems damn slow (I may be unfairly blaming 4DOM, since I'm using a
be my fault, or the fault of the JS code I'm running), but of course there
are faster, more compliant implementations, so that shouldn't be a
Finally, DOM *processing* might well be faster using tidylib just as a
tidier than it would be as a DOM (especially if you wrap the tidy-DOM to
get a real, compliant, DOM).
More information about the Web-SIG