Simple HTML to XML parser?

Tue Nov 14 21:47:15 EST 2000

Alex Martelli wrote:

><jsantaniello at my-deja.com> wrote in message
>news:8uq132$rek$1 at nnrp1.deja.com...
>> Hi Everyone,
>>
>> Does anyone have or know of a simple HTML to XML parser? The sax package
>
>A simple HTML parser is module htmllib, but I'm not sure what
>you mean by "HTML to XML parser".  What the "to XML" part of
>this request...?  You mean "HTML to XML translator", or...?
>
>> is too much for me to handle. What I'm looking for is the ability to
>> grab some html with urllib for example and then access an object like:
>>
>> page = urlopen(url)
>> the_value = page.body.form[0].hidden_element_name.value
>
>Ah, you'd like to get the DOM of an HTML document.  I'm
>not sure the XML DOM implementations in Python support
>HTML DOMs too, though...

<snip>

Disclaimer : I'm a python newbie - there are probably many better ways to
do this...

The standard XML DOM model, as described in the XML Howto at
http://www.python.org/doc/howto/xml/xml-howto.html, handles basic html with
no problems.  For some strange reason, the howto above seems to have lost
this information, but the one I printed out a couple of weeks ago gives
this example (from section 4.1) :

from xml.dom.utils import FileReader
import urllib
URL = 'http://localhost//index.html'
sock = urllib.urlopen(URL)
f = FileReader()
doc = f.readFile('index.html', sock)

After this, doc is a standard DOM object, so you have to traverse the tree
to get to anything.  For example, doc.firstChild is the main <HTML> node.
doc.firstChild.firstChild is the <head> node.  doc.firstChild.childNodes[1]
is the <body> node, and so on.

-Korny
--
Kornelis Sietsma   http://www.sietsma.com/korny  korny at sietsma.com