ElementTree oddities

Mon Sep 15 11:22:29 EDT 2008

I'm trying to extract the text from some xml. I figured this
convenient python two-liner would do it for me:
>>> from xml.etree.ElementTree import *
>>> from cStringIO import StringIO
>>> root = parse(StringIO(xml)).getroot()
>>> ' '.join([n.text for n in root.getiterator() if n.text is not None])

However, it's missing some of the text. For example, the following
XML:
>>> xml = "<highlight><sp />Bar</highlight>"

Returns me a empty string. Seems the "<sp />" tag is borking it.

Also, the for the following XML:
>>> xml = "<highlight><ref>Bar</ref>:</highlight>"

I only get "Bar". It's missing the trailing colon.

I'm not that experienced with XML so perhaps I am just missing
something here. Please enlighten me.

Thanks,
Brian