[Web-SIG] DOM-based templating

James Y Knight foom at fuhm.net
Fri Jun 3 18:20:07 CEST 2005


On Jun 3, 2005, at 2:18 AM, Ian Bicking wrote:
> * Can parse HTML, not just XHTML.  Not the crazy HTML browsers parse,
> but unambiguous well-formed HTML.  I don't like the idea of putting  
> the
> HTML through tidy; that's fine for a screen-scraper, but is way too
> defensive for this kind of thing.

Ha ha. As far as I've seen, there is no python module that can do  
this. And yes, by "this" I do mean actual correct HTML, not random  
crud real browsers have to put up with. Even unambiguous well-formed  
HTML is difficult to parse into a useful DOM. I'd suggest perl -- the  
perl HTML::TreeBuilder and HTML::Tagset modules seem to get it right,  
and can also parse lots of random crud.

I've been trying for a while to convince someone to copy the  
algorithms from those perl modules into twisted.web.microdom and  
twisted.web.sux so that they would actually work right. It purports  
to parse HTML into a DOM, and currently does get a lot of stuff  
right, but also gets a bunch wrong. It doesn't depend on twisted very  
much, so it could make sense for someone to adopt as a separate  
package if you're into that.

Getting whitespace rules right, in particular, is quite difficult.  
While the actual rules from the actual HTML spec are easy enough,  
they are not what you think they are, and not what anybody has  
implemented. The actual rules are much more complex, and depend on  
the element the whitespace is near and also seem to allow whitespace  
to float into and out of elements with relative abandon. Quick, tell  
me when the space in here is significant: "<span> foo</span>"?

Just inserting all the whitespace from the original document into the  
DOM is a pretty safe thing to do, but it'd be nice to not have to do  
that, as you end up with excessive numbers of text nodes that have no  
meaning.

Dealing with the optional closing of tags in HTML is somewhat  
irritating as well. Here's an example of a correct HTML document:  
"<title>Hello</title><table><tr><td><p>Foo<tr><td>Bar</table>".  
Microdom gets the table there wrong -- it has a list of which opening  
tags can close which others, but only looks at the current level, not  
up the tree, to find the tag to close. So it creates a structure  
like: table[tr[td[p["Foo", tr[td["Bar"]]]]]]. Besides the obvious  
issue of the tr being inside the p, the DOM should probably include  
inferred elements as well, such as html, head, body, and tbody.

On Jun 3, 2005, at 11:33 AM, Ian Bicking wrote:
> OTOH, this might all be better resolved with a Firefox extension or
> bookmarklet or somesuch, that may or may not already exist.

http://users.skynet.be/mgueury/mozilla/

James



More information about the Web-SIG mailing list