[Tutor] Extracting xml text

Sun Jun 20 10:14:52 CEST 2010

T.R. D., 20.06.2010 08:03:
> I'm trying to parse a list of xml strings and so far it looks like the
> xml.parsers.expat is the way to go but I'm not quite sure how it works.
>
> I'm trying to parse something similar to the following.  I'd like to collect
> all headings and bodies and associate them in a variable (dictionary for
> example). How would I use the expat class to do this?

Well, you *could* use it, but I *would* not recommend it. :)

> <note>
> <to>Tove</to>
> <from>Jani</from>
> <heading>Reminder</heading>
> <body>Don't forget me this weekend!</body>
> </note>
>
> <note>
> <to>Jani</to>
> <from>Tovi</from>
> <heading>Reminder 2</heading>
> <body>Don't forget to bring snacks!</body>
> </note>

Use ElementTree's iterparse:

     from xml.etree.cElementTree import iterparse

     for _, element in iterparse("the_file.xml"):
         if element.tag == 'note':
             # find text in interesting child elements
             print element.findtext('heading'), element.findtext('body')

             # safe some memory by removing the handled content
             element.clear()

iterparse() iterates over parser events, but it builds an in-memory XML 
tree while doing so. That makes it trivial to find things in the stream. 
The above code receives an event whenever a tag closes, and starts working 
when the closing tag is a 'note' element, i.e. when the complete subtree of 
the note element has been parsed into memory.

Stefan