Help with libxml2dom

Wed Aug 19 10:10:08 EDT 2009

On 19 Aug, 13:55, Nuno Santos <nuno.hespan... at gmail.com> wrote:
> I have just started using libxml2dom to read html files and I have some
> questions I hope you guys can answer me.

[...]

>  >>> table = body.firstChild
>  >>> table.nodeName
> u'text' #?! Why!? Shouldn't it be a table? (1)

You answer this yourself just below.

>  >>> table = body.firstChild.nextSibling #why this works? is there a
> text element hidden? (2)
>  >>> table.nodeName
> u'table'

Yes, in the DOM, the child nodes of elements include text nodes, and
even though one might regard the whitespace before the first child
element and that appearing after the last child element as
unimportant, the DOM keeps it around in case it really is important.

[...]

> It seems like sometimes there are some text elements 'hidden'. This is
> probably a standard in DOM I simply am not familiar with this and I
> would very much appreciate if anyone had the kindness to explain me this.

Well, the nodes are actually there: they're whitespace used to provide
the indentation in your example. I recommend using XPath to get actual
elements:

table = body.xpath("*")[0] # get child elements and then select the
first

Although people make a big "song and dance" about the DOM being a
nasty API, it's quite bearable if you use it together with XPath
queries.

Paul