Simple HTML to XML parser?
Kornelis Sietsma
korny at sietsma.com
Tue Nov 14 21:47:15 EST 2000
Alex Martelli wrote:
><jsantaniello at my-deja.com> wrote in message
>news:8uq132$rek$1 at nnrp1.deja.com...
>> Hi Everyone,
>>
>> Does anyone have or know of a simple HTML to XML parser? The sax package
>
>A simple HTML parser is module htmllib, but I'm not sure what
>you mean by "HTML to XML parser". What the "to XML" part of
>this request...? You mean "HTML to XML translator", or...?
>
>> is too much for me to handle. What I'm looking for is the ability to
>> grab some html with urllib for example and then access an object like:
>>
>> page = urlopen(url)
>> the_value = page.body.form[0].hidden_element_name.value
>
>Ah, you'd like to get the DOM of an HTML document. I'm
>not sure the XML DOM implementations in Python support
>HTML DOMs too, though...
<snip>
Disclaimer : I'm a python newbie - there are probably many better ways to
do this...
The standard XML DOM model, as described in the XML Howto at
http://www.python.org/doc/howto/xml/xml-howto.html, handles basic html with
no problems. For some strange reason, the howto above seems to have lost
this information, but the one I printed out a couple of weeks ago gives
this example (from section 4.1) :
from xml.dom.utils import FileReader
import urllib
URL = 'http://localhost//index.html'
sock = urllib.urlopen(URL)
f = FileReader()
doc = f.readFile('index.html', sock)
After this, doc is a standard DOM object, so you have to traverse the tree
to get to anything. For example, doc.firstChild is the main <HTML> node.
doc.firstChild.firstChild is the <head> node. doc.firstChild.childNodes[1]
is the <body> node, and so on.
-Korny
--
Kornelis Sietsma http://www.sietsma.com/korny korny at sietsma.com
More information about the Python-list
mailing list