Help with libxml2dom
Diez B. Roggisch
deets at nospam.web.de
Wed Aug 19 08:45:13 EDT 2009
Nuno Santos wrote:
> I have just started using libxml2dom to read html files and I have some
> questions I hope you guys can answer me.
>
> The page I am working on (teste.htm):
> <html>
> <head>
> <title>
> Title
> </title>
> </head>
> <body bgcolor = 'FFFFF'>
> <table>
> <tr bgcolor="#EEEEEE">
> <td nowrap="nowrap">
> <font size="2" face="Tahoma, Arial"> <a name="1375048"></a>
> </font>
> </td>
> <td nowrap="nowrap">
> <font size="-2" face="Verdana"> 8/15/2009</font>
> </td>
> </tr>
> </table>
> </body>
> </html>
>
> >>> import libxml2dom
> >>> foo = open('teste.htm', 'r')
> >>> str1 = foo.read()
> >>> doc = libxml2dom.parseString(str1, html=1)
> >>> html = doc.firstChild
> >>> html.nodeName
> u'html'
> >>> head = html.firstChild
> >>> head.nodeName
> u'head'
> >>> title = head.firstChild
> >>> title.nodeName
> u'title'
> >>> body = head.nextSibling
> >>> body.nodeName
> u'body'
> >>> table = body.firstChild
> >>> table.nodeName
> u'text' #?! Why!? Shouldn't it be a table? (1)
> >>> table = body.firstChild.nextSibling #why this works? is there a
> text element hidden? (2)
> >>> table.nodeName
> u'table'
> >>> tr = table.firstChild
> >>> tr.nodeName
> u'tr'
> >>> td = tr.firstChild
> >>> td.nodeName
> u'td'
> >>> font = td.firstChild
> >>> font.nodeName
> u'text' # (1)
> >>> font = td.firstChild.nextSibling # (2)
> >>> font.nodeName
> u'font'
> >>> a = font.firstChild
> >>> a.nodeName
> u'text' #(1)
> >>> a = font.firstChild.nextSibling #(2)
> >>> a.nodeName
> u'a'
>
>
> It seems like sometimes there are some text elements 'hidden'. This is
> probably a standard in DOM I simply am not familiar with this and I
> would very much appreciate if anyone had the kindness to explain me this.
Without a schema or something similar, a parser can't tell if whitespace is
significant or not. So if you have
<root>
<child/>
</root>
you will have not 2, but 4 nodes - root, text containing a newline + 2
spaces, child, and again a text with a newline.
You have to skip over those that you are not interested in, or use a
different XML-library such as ElementTree (e.g. in the form of lxml) that
has a different approach about text-nodes.
Diez
More information about the Python-list
mailing list