[XML-SIG] cElementTree.iterparse missing text in some start events

Tue Jan 25 05:58:14 CET 2005

I'm using cElementTree.iterparse to iterate over an XML file. I think
iterparse is a wonderful idea - I've found it to be much more convenient
than SAX for iterative processing. I have come across a problem
though...

For the majority of my elements, both the start and end events contain
the text of the element (i.e., element.text). For a handful of the
elements, the text is only in the end event (i.e., element.text is None
in the start event but it is not None in the end event). The text is
found without any problem when using cElementTree.parse on the file
instead.

A small test to reproduce this behavior is at the end of this note and
an 80KB sample xml file is at
http://www.averdevelopment.com/python/test.xml. The test file is
whittled down from a much larger file which had the problem with several
more elements (but only a very small percentage of the total). I
couldn't seem to delete any elements before the element in question
without changing the behavior.

Am I misunderstanding something or is this perhaps a bug?

I'm using:
http://effbot.org/downloads/cElementTree-0.9.8-20050123.win32-py2.3.exe
http://effbot.org/downloads/elementtree-1.2.4-20041228.win32.exe
http://python.org/ftp/python/2.3.4/Python-2.3.4.exe
Windows XP SP2

Thanks,
Jimmy

####################################################

import sets
from cElementTree import dump, iterparse, parse

values = dict(start=sets.Set(), end=sets.Set())

i = 0
for event, element in iterparse('test.xml', ('start', 'end')):
    if element.tag.endswith('}ele') and element.text:
        values[event].add(element.text)
    if element.tag.endswith('}ele') and element.text is None:
        print i, event + ' '
        dump(element)
    if element.text == '297.257582':
        print i, event + ' '
        dump(element)
    i += 1

print 'In start but not end:', values['start'] - values['end']
print 'In end but not start:', values['end'] - values['start']
print

# Finding the same text with ElementTree is no problem
gpx = parse('test.xml').getroot()
trk = element.findall('{http://www.topografix.com/GPX/1/1}trk')[-1]
trkseg = trk.findall('{http://www.topografix.com/GPX/1/1}trkseg')[-1]
trkpt = trkseg.findall('{http://www.topografix.com/GPX/1/1}trkpt')[-2]
ele = trkpt.findall('{http://www.topografix.com/GPX/1/1}ele')[0]
print ele.text

####################################################

Output:

3622 start 
<ns0:ele xmlns:ns0="http://www.topografix.com/GPX/1/1" />
3623 end 
<ns0:ele
xmlns:ns0="http://www.topografix.com/GPX/1/1">297.257582</ns0:ele>

In start but not end: Set([])
In end but not start: Set(['297.257582'])

297.257582