[Tutor] Parsing a block of XML text

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Sun Jan 2 08:31:45 CET 2005



On Fri, 31 Dec 2004, kumar s wrote:

> http://www.python.org/doc/lib/dom-example.html
>
> Frankly it looked more complex. could I request you to explain your
> pseudocode. It is confusing when you say call a function within another
> function.


Hi Kumar,

A question, though: can you try to explain what part feels weird about
having a function call another function?

Is it something specific to XML processing, or a more general problem?
That is, do you already feel comfortable with writing and using "helper"
functions?

If you're feeling uncomfortable with the idea of functions calling
functions, then that's something we should probably concentrate on,
because it's really crucial to use this technique, especially on
structured data like XML.



As a concrete toy example of a function that calls another function, we
can use the overused hypotenuse function.  Given right triangle leg
lengths 'a' and 'b', this function returns the length of the hypotenuse:

###
def hypotenuse(a, b):
    return (a**2 + b**2)**0.5
###

This definition works, but we can use helper functions to make the
hypotenuse function a little bit more like English:

###
def sqrt(x):
    return x ** 0.5

def square(x):
    return x * x

def hypotenuse(a, b):
    return sqrt(square(a) + square(b))
###

In this variation, the rewritten hypotenuse() function uses the other two
functions as "helpers".  The key idea is that the functions that we write
can then be used by anything that needs it.

Another thing that happens is that hypotenuse() doesn't have to know how
sqrt() and square()  are defined: it just depends on the fact that sqrt()
and square() are out there, and it can just use these as tools.  Computer
scientists call this "abstraction".



Here is another example of another "helper" function that comes in handy
when we do XML parsing:

###
def get_children(node, tagName):
    """Returns the children elements of the node that have this particular
    tagName.  This is different from getElementsByTagName() because we
    only look shallowly at the immediate children of the given node."""
    children = []
    for n in node.childNodes:
        if n.nodeType == n.ELEMENT_NODE and n.tagName == tagName:
            children.append(n)
    return children
###


For example:

###
>>> import xml.dom.minidom
>>> dom = xml.dom.minidom.parseString("<p><a>hello</a><a>world</a></p>")
>>>
>>> dom.firstChild
<DOM Element: p at 0x50efa8>
>>>
>>> get_children(dom.firstChild, "a")
[<DOM Element: a at 0x50efd0>, <DOM Element: a at 0x516058>]
###




> It is confusing when you say call a function within another function.

Here's a particular example that uses this get_children() function and
that get_text() function that we used in the earlier part of this thread.

###
def parse_Hsp(hsp_node):
    """Prints out the query-from and query-to of an Hsp node."""
    query_from = get_text(get_children(hsp_node, "Hsp_query-from")[0])
    query_to = get_text(get_children(hsp_node, "Hsp_query-to")[0])
    print query_from
    print query_to
###

This function only knows how to deal with Hsp_node elements.  As soon as
we can dive through our DOM tree into an Hsp element, we should be able to
extract the data we need.  Does this definition of parse_Hsp() make sense?


You're not going to be able to use it immediately for your real problem
yet, but you can try it on a sample subset of your XML data:

###
sampleData = """
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>1164.13</Hsp_bit-score>
<Hsp_score>587</Hsp_score>
<Hsp_evalue>0</Hsp_evalue>
<Hsp_query-from>1</Hsp_query-from>
<Hsp_query-to>587</Hsp_query-to>
</Hsp>
"""
doc = xml.dom.minidom.parseString(sampleData)
parse_Hsp(doc.firstChild)
###

to see how it works so far.




This example tries to show the power of being able to call helper
functions.  If we were to try to write this all using DOM primitives, the
end result would look too ugly for words.  But let's see it anyway.
*grin*

###
def parse_Hsp(hsp_node):  ## without using helper functions:
    """Prints out the query-from and query-to of an Hsp node."""
    query_from, query_to = "", ""
    for child in hsp_node.childNodes:
        if (child.nodeType == child.ELEMENT_NODE and
            child.tagName == "Hsp_query-from"):
            for n in child.childNodes:
                if n.nodeType == n.TEXT_NODE:
                    query_from += n.data
        if (child.nodeType == child.ELEMENT_NODE and
            child.tagName == "Hsp_query-to"):
            for n in child.childNodes:
                if n.nodeType == n.TEXT_NODE:
                    query_to += n.data
    print query_from
    print query_to
###

This is exactly the kind of code we want to avoid.  It works, but it's so
fragile and hard to read that I just don't trust it.  It just burns my
eyes.  *grin*

By using "helper" functions, we're extending Python's vocabulary of
commands.  We can then use those functions to help solve our problem with
less silliness.  This is a reason why knowing how to write and use
functions is key to learning how to program: this principle applies
regardless of what particular programming language we're using.


If you have questions on any of this, please feel free to ask.  Good luck!



More information about the Tutor mailing list