[Web-SIG] DOM-based templating
James Y Knight
foom at fuhm.net
Fri Jun 3 18:20:07 CEST 2005
On Jun 3, 2005, at 2:18 AM, Ian Bicking wrote:
> * Can parse HTML, not just XHTML. Not the crazy HTML browsers parse,
> but unambiguous well-formed HTML. I don't like the idea of putting
> the
> HTML through tidy; that's fine for a screen-scraper, but is way too
> defensive for this kind of thing.
Ha ha. As far as I've seen, there is no python module that can do
this. And yes, by "this" I do mean actual correct HTML, not random
crud real browsers have to put up with. Even unambiguous well-formed
HTML is difficult to parse into a useful DOM. I'd suggest perl -- the
perl HTML::TreeBuilder and HTML::Tagset modules seem to get it right,
and can also parse lots of random crud.
I've been trying for a while to convince someone to copy the
algorithms from those perl modules into twisted.web.microdom and
twisted.web.sux so that they would actually work right. It purports
to parse HTML into a DOM, and currently does get a lot of stuff
right, but also gets a bunch wrong. It doesn't depend on twisted very
much, so it could make sense for someone to adopt as a separate
package if you're into that.
Getting whitespace rules right, in particular, is quite difficult.
While the actual rules from the actual HTML spec are easy enough,
they are not what you think they are, and not what anybody has
implemented. The actual rules are much more complex, and depend on
the element the whitespace is near and also seem to allow whitespace
to float into and out of elements with relative abandon. Quick, tell
me when the space in here is significant: "<span> foo</span>"?
Just inserting all the whitespace from the original document into the
DOM is a pretty safe thing to do, but it'd be nice to not have to do
that, as you end up with excessive numbers of text nodes that have no
meaning.
Dealing with the optional closing of tags in HTML is somewhat
irritating as well. Here's an example of a correct HTML document:
"<title>Hello</title><table><tr><td><p>Foo<tr><td>Bar</table>".
Microdom gets the table there wrong -- it has a list of which opening
tags can close which others, but only looks at the current level, not
up the tree, to find the tag to close. So it creates a structure
like: table[tr[td[p["Foo", tr[td["Bar"]]]]]]. Besides the obvious
issue of the tr being inside the p, the DOM should probably include
inferred elements as well, such as html, head, body, and tbody.
On Jun 3, 2005, at 11:33 AM, Ian Bicking wrote:
> OTOH, this might all be better resolved with a Firefox extension or
> bookmarklet or somesuch, that may or may not already exist.
http://users.skynet.be/mgueury/mozilla/
James
More information about the Web-SIG
mailing list