[Tutor] Python & XML

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Tue May 11 18:41:21 EDT 2004



On Tue, 11 May 2004 justinstraube at charter.net wrote:

> Sorry, for the repost but I accidentally sent before I could ask for a
> small bit of help. Can anyone spare a minute to look over this and give
> any points as to what I may be doing wrong or what could be done better?

Hi Justin,


Code review time!  *grin* Here are some suggestions:



It might be good to break out some of the functions into separate
functions.  For example:


>     print '\t', _name_, _ver_, '\t', _date_
>     print '\n\tCreated by:', _auth_
>     print '\tWritten in:', _inf_



could be given a name like printWelcome():

###
def printWelcome():
    print '\t', _name_, _ver_, '\t', _date_
    print '\n\tCreated by:', _auth_
    print '\tWritten in:', _inf_
###




Another part that can be broken out is the grabbing of the XML data:

>     add = '/fcgi-bin/fcgi?cmd=user_xml&email=' + usrmail
>
>     conx = httplib.HTTPConnection(SETI)
>     conx.request("GET", add)
>     response = conx.getresponse()
>
>     if response.status == 500:
>         print response.status, response.reason
>         print """\n\t\tServer Error\n
>         The page or site is temporarily down or disabled.
>         Sorry for the inconvienience.
>
>         Please try again later.\n"""
>         break
>
>     data = response.read()
>
>     if data[-33:] == 'No user with that name was found.':
>         print """\n\t\tNo user with that name was found.
>         Please check the email address and try again.\n"""
>         break
>
>     conx.close()
>
>     xdata = minidom.parseString(data)


This block can be given a name like 'getXmlDomTree()':


###
def getXmlDomTree(usrmail):
    add = '/fcgi-bin/fcgi?cmd=user_xml&email=' + usrmail

    conx = httplib.HTTPConnection(SETI)
    conx.request("GET", add)
    response = conx.getresponse()

    if response.status == 500:
        print response.status, response.reason
        print """\n\t\tServer Error\n
        The page or site is temporarily down or disabled.
        Sorry for the inconvienience.

        Please try again later.\n"""
        break

    data = response.read()

    if data[-33:] == 'No user with that name was found.':
        print """\n\t\tNo user with that name was found.
        Please check the email address and try again.\n"""
        break

    conx.close()

    xdata = minidom.parseString(data)
    return xdata
###


Breaking things down into smaller functions is not really for the
computer, but for us humans.  *grin* It's a little easier to understand a
small function, because we can limit the scope of what we're look at to a
"paragraph" of program text.





The block of code that starts of as:

>                 for node in item.childNodes:
>                     if node.nodeType == node.ELEMENT_NODE:
>                         if node.tagName == 'name':
>                             uname = node.toxml()
>                             ud[1] = uname[6:-7] + ' <' + usrmail + '>'
>                             node.unlink
>                         elif node.tagName == 'numresults':
>                             num = node.toxml()
>                             ud[2] = num[12:-13]
>                             node.unlink
>                         elif node.tagName == 'cputime':
>                             cput = node.toxml()
>                             ud[3] = cput[9:-10]
>                             node.unlink
>                         elif node.tagName == 'avecpu':
>                             avg = node.toxml()
>                             ud[4] = avg[8:-9]
>                             node.unlink
>                         elif node.tagName == 'resultsperday':
>                             rpd = node.toxml()
>                             ud[5] = rpd[15:-16]
>                             node.unlink
>                         elif node.tagName == 'lastresulttime':
>                             lrr = node.toxml()
>                             ud[6] = lrr[16:-17]
>                             node.unlink
>                         elif node.tagName == 'regdate':
>                             regd = node.toxml()
>                             ud[7] = regd[9:-10]
>                             node.unlink
>                         elif node.tagName == 'usertime':
>                             usrt = node.toxml()
>                             ud[8] = usrt[10:-11]
>                             node.unlink
>                         else:
>                             node.unlink

is repetitive.  Long chains of if/elif/elif/elif can often be simplified
if we put some of the program as data, rather than as explicit program
logic.



What does this mean?  Here's another way to get a similar effect:

######
# dispatchTable is a mapping from the tagName we're looking for, to a
# triple (ud_offset, left_slice, right_slice)
dispatchTable = { 'name'         : (1, 6, -7),
                  'numresults'   : (2, 12, -13),
                  'cputime'      : (3, 9, -10),
                  'avecpu'       : (4, 8, -9),
                  'resultsperday': (5, 15, -16),
                  'lastresultime': (6, 16, -17),
                  'regdate'      : (7, 9, -10),
                  'usertime'     : (8, 10, -11),
                  }
for node in item.childNodes:
    if node.nodeType == node.ELEMENT_NODE:
        if node.tagName in dispatchTable:
            offset, left_slice, right_slice = dispatchTable[node.tagName]
            ud[offset] = node.toxml()[left_slice : right_slice]
            node.unlink()
        else:
            node.unlink()
######


We can get rid of the repetitive logic, and keep the thing that is really
changing --- the index into 'ud' that we assign into, and the left and
right indices of our slice.  Not only is the code shorter, but it probably
is a little easier to debug: if we find that one of the numbers is wrong,
we can quickly zoom in, since all the numbers are visually collected
together.



We did lose a little, because now 'name' is being handled the same way as
the other tags.  But we can fix this by treating it as a special case:

###
dispatchTable = {
                  ## name is treated as a special case
                  'numresults'   : (2, 12, -13),
                  'cputime'      : (3, 9, -10),
                  'avecpu'       : (4, 8, -9),
                  'resultsperday': (5, 15, -16),
                  'lastresultime': (6, 16, -17),
                  'regdate'      : (7, 9, -10),
                  'usertime'     : (8, 10, -11),
                  }
for node in item.childNodes:
    if node.nodeType == node.ELEMENT_NODE:
        if node.tagName in dispatchTable:
            offset, left_slice, right_slice = dispatchTable[node.tagName]
            ud[offset] = node.toxml()[left_slice : right_slice]
            node.unlink()
        elif node.tagName == 'name':
            uname = node.toxml()
            ud[1] = uname[6:-7] + ' <' + usrmail + '>'
            node.unlink()
        else:
            node.unlink()
###

So we benefit of treating the general case in a uniform way, but also
leave some room for doing special-case stuff if we really need it.



One other thing, though: we don't have to do the node.toxml()/slicing
technique to get at the text within a node.  That approach is error prone,
since it tries to work with indices that can change at the whim of an XML
DOM implementation.  Instead, take a look at the example in:

    http://www.python.org/doc/lib/dom-example.html

The example uses a 'getText()' function that's more robust.  Look at the
example, and then just cut-and-paste it.  *grin* This alone should cut
down on a lot of the manual string-slicing that you must have calculated
by hand.


With the techniques above, you can probably cut down the program
considerably to something very concise and sweet.


Please feel free to ask questions on any of this.  Good luck to you!




More information about the Tutor mailing list