Help with libxml2dom

Wed Aug 19 08:45:13 EDT 2009

Nuno Santos wrote:

> I have just started using libxml2dom to read html files and I have some
> questions I hope you guys can answer me.
> 
> The page I am working on (teste.htm):
> <html>
>   <head>
>     <title>
>       Title
>     </title>
>   </head>
>   <body bgcolor = 'FFFFF'>
>     <table>
>       <tr bgcolor="#EEEEEE">
>         <td nowrap="nowrap">
>           <font size="2" face="Tahoma, Arial"> <a name="1375048"></a>
> </font>
>         </td>
>         <td nowrap="nowrap">
>           <font size="-2" face="Verdana"> 8/15/2009</font>
>         </td>
>       </tr>
>     </table>
>   </body>
> </html>
> 
>  >>> import libxml2dom
>  >>> foo = open('teste.htm', 'r')
>  >>> str1 = foo.read()
>  >>> doc = libxml2dom.parseString(str1, html=1)
>  >>> html = doc.firstChild
>  >>> html.nodeName
> u'html'
>  >>> head = html.firstChild
>  >>> head.nodeName
> u'head'
>  >>> title = head.firstChild
>  >>> title.nodeName
> u'title'
>  >>> body = head.nextSibling
>  >>> body.nodeName
> u'body'
>  >>> table = body.firstChild
>  >>> table.nodeName
> u'text' #?! Why!? Shouldn't it be a table? (1)
>  >>> table = body.firstChild.nextSibling #why this works? is there a
> text element hidden? (2)
>  >>> table.nodeName
> u'table'
>  >>> tr = table.firstChild
>  >>> tr.nodeName
> u'tr'
>  >>> td = tr.firstChild
>  >>> td.nodeName
> u'td'
>  >>> font = td.firstChild
>  >>> font.nodeName
> u'text' # (1)
>  >>> font = td.firstChild.nextSibling # (2)
>  >>> font.nodeName
> u'font'
>  >>> a = font.firstChild
>  >>> a.nodeName
> u'text' #(1)
>  >>> a = font.firstChild.nextSibling #(2)
>  >>> a.nodeName
> u'a'
> 
> 
> It seems like sometimes there are some text elements 'hidden'. This is
> probably a standard in DOM I simply am not familiar with this and I
> would very much appreciate if anyone had the kindness to explain me this.

Without a schema or something similar, a parser can't tell if whitespace is
significant or not. So if you have 

<root>
  <child/>
</root>

you will have not 2, but 4 nodes - root, text containing a newline + 2
spaces, child, and again a text with a newline.

You have to skip over those that you are not interested in, or use a
different XML-library such as ElementTree (e.g. in the form of lxml) that
has a different approach about text-nodes.

Diez