Help with libxml2dom
Nuno Santos
nuno.hespanhol at gmail.com
Wed Aug 19 07:55:02 EDT 2009
I have just started using libxml2dom to read html files and I have some
questions I hope you guys can answer me.
The page I am working on (teste.htm):
<html>
<head>
<title>
Title
</title>
</head>
<body bgcolor = 'FFFFF'>
<table>
<tr bgcolor="#EEEEEE">
<td nowrap="nowrap">
<font size="2" face="Tahoma, Arial"> <a name="1375048"></a>
</font>
</td>
<td nowrap="nowrap">
<font size="-2" face="Verdana"> 8/15/2009</font>
</td>
</tr>
</table>
</body>
</html>
>>> import libxml2dom
>>> foo = open('teste.htm', 'r')
>>> str1 = foo.read()
>>> doc = libxml2dom.parseString(str1, html=1)
>>> html = doc.firstChild
>>> html.nodeName
u'html'
>>> head = html.firstChild
>>> head.nodeName
u'head'
>>> title = head.firstChild
>>> title.nodeName
u'title'
>>> body = head.nextSibling
>>> body.nodeName
u'body'
>>> table = body.firstChild
>>> table.nodeName
u'text' #?! Why!? Shouldn't it be a table? (1)
>>> table = body.firstChild.nextSibling #why this works? is there a
text element hidden? (2)
>>> table.nodeName
u'table'
>>> tr = table.firstChild
>>> tr.nodeName
u'tr'
>>> td = tr.firstChild
>>> td.nodeName
u'td'
>>> font = td.firstChild
>>> font.nodeName
u'text' # (1)
>>> font = td.firstChild.nextSibling # (2)
>>> font.nodeName
u'font'
>>> a = font.firstChild
>>> a.nodeName
u'text' #(1)
>>> a = font.firstChild.nextSibling #(2)
>>> a.nodeName
u'a'
It seems like sometimes there are some text elements 'hidden'. This is
probably a standard in DOM I simply am not familiar with this and I
would very much appreciate if anyone had the kindness to explain me this.
Thanks.
More information about the Python-list
mailing list