[Tutor] xml parsing [an introduction to xml.dom.minidom]

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Thu Nov 21 15:41:59 2002


On Thu, 21 Nov 2002 alan.gauld@bt.com wrote:

> > little bit of documentation of the re module, I think I may
> > be able to figure out how to do this for the general situation
> > if I can come to understand the re module somewhat correctly.
>
> Sorry, but you won't, regular expressions are the wrong tool for working
> with any serious level of XML. They don't handle recursive definitions
> at all well, and will get you tied up in ever increasing knots.

Hello!


Here's an example that shows how to parse that XML using minidom, a module
that implements the "Document Object Model" system.  First, let's say that
we're working with the following xml string:

###
>>> xmlstring = """
... <main>
... <block1><item1>1.0</item1><item2>1.234</item2></block1>
... <block2><item1>6.4</item1><item2>4</item2></block2>
... </main>"""
###

The example string above is slighly different from your original XML
string, because XML documents need to be wrapped in a main 'document
element' block.  I've put your XML fragments in an artificial 'main' block
just so that the parser's happy.


The xml.dom.minidom module, described at:

    http://www.python.org/doc/current/lib/module-xml.dom.minidom.html

gives us a quick-and-dirty way to parse our XML.  How do we construct one?

###
>>> import xml.dom.minidom
>>> dom = xml.dom.minidom.parseString(xmlstring)
>>> dom
<xml.dom.minidom.Document instance at 0x827dc0c>
>>> dom.documentElement.tagName
u'main'
###

The parseString() function gives us back a "document" object that has a
few methods we can use.  One of the important ones for diving into the
tree is 'getElementsByTagName', which returns to us a list of "nodes":


###
>>> blocks = dom.getElementsByTagName('block1')
>>> blocks
[<DOM Element: block1 at 136832988>]
>>> first_block = blocks[0]
>>> first_block.childNodes
[<DOM Element: item1 at 136829852>, <DOM Element: item2 at 136835484>]
###

Using these methods, we find ourselves diving deeper into our tree.
There's item1 and item2!  And these inner nodes, too, have children.  The
DOM allows us to dive into the structure of an XML file.



Let's dive in a bit deeper:

###
>>> first_block.childNodes[0].childNodes[0]
<DOM Text node "1.0">
###

We finally reach into one of the textual values in there.  Oh, by the way,
as a convenience, we can access the first child of any "node" by using its
'firstChild' attribute, so we can said:

###
>>> first_block.firstChild.firstChild
<DOM Text node "1.0">
###

So we have this text node... but it's still a node.  How do we really
extract the text from it?  We can grab the text off of a "text node"  by
grabbing its 'data' attribute:

###
>>> first_block.firstChild.firstChild.data
u'1.0'
###





Whew!  Let's look back at our original XML:

###
<main>
<block1><item1>1.0</item1><item2>1.234</item2></block1>
<block2><item1>6.4</item1><item2>4</item2></block2>
<main>
###


Let's say that we're interested in the block2/item2 text value.  How can
we grab at that one?  Try it out, and then look below.

...
...
...
...
...
...
...


Ok, time's up.  *grin* Here's one way to do it:

###

>>>
dom.getElementsByTagName('block2')[0].getElementsByTagName('item2')[0].firstChild.data
u'4'
>>>
>>> for node in dom.getElementsByTagName('item2'):
...     print node.parentNode
...
<DOM Element: block1 at 136832988>
<DOM Element: block2 at 136835076>
>>> dom.getElementsByTagName('item2')[1].firstChild.data
u'4'
###



Alternatively, we could have gone through it methodically:

###
>>> dom.getElementsByTagName('main')[0].firstChild
<DOM Element: block1 at 136832988>
>>>
>>> dom.getElementsByTagName('main')[0].firstChild.nextSibling
<DOM Element: block2 at 136835076>
>>>
>>> dom.getElementsByTagName('main')[0].firstChild\
...     .nextSibling.firstChild
<DOM Element: item1 at 136836580>
>>>
>>> dom.getElementsByTagName('main')[0].firstChild\
...     .nextSibling.firstChild.nextSibling
<DOM Element: item2 at 136838772>
>>>
>>> dom.getElementsByTagName('main')[0].firstChild\
...     .nextSibling.firstChild.nextSibling.firstChild.data
u'4'
###

But since we know exactly what the tag name is, we can use
getElementsByTagName() without having to manually traverse the whole tree,
node by node.



I hope this helps make the DOM a little easier to understand.  There's
more documentation about the methods in a DOM document at:

    http://www.w3.org/TR/REC-DOM-Level-1/level-one-core.html

It's a little... umm... wordy.  But there really is some useful stuff in
there.  I'd recommend skipping to section 1.2 and look at the "Interface
Document" and "Interface Node" sections.  Playing around with the dom in
the interactive interpreter will be helpful as you probe your DOM's
attributes.



Good luck!