[XML-SIG] Re: cElementTree.iterparse missing text in some start events

Fredrik Lundh fredrik at pythonware.com
Tue Jan 25 09:09:16 CET 2005

Jimmy Retzlaff wrote:

> I'm using cElementTree.iterparse to iterate over an XML file. I think
> iterparse is a wonderful idea - I've found it to be much more convenient
> than SAX for iterative processing. I have come across a problem
> though...
> For the majority of my elements, both the start and end events contain
> the text of the element (i.e., element.text). For a handful of the
> elements, the text is only in the end event (i.e., element.text is None
> in the start event but it is not None in the end event). The text is
> found without any problem when using cElementTree.parse on the file
> instead.

> Am I misunderstanding something or is this perhaps a bug?

it needs more documentation ;-)

here's what the comment in the CHANGES document says:

    The elem object is the current element; for "start" events,
    the element itself has been created (including attributes), but its
    contents may not be complete; for "end" events, all child elements
    has been processed as well.  You can use "start" tags to count
    elements, check attributes, and check if certain tags are present
    in a tree.  For all other purposes, use "end" handlers instead.

in that text, "may not" really means "may or may not".  that is, the contents
may be complete, but that's nothing you can or should rely on.

the reason for this is that events don't fire in perfect lockstep with the build
process; in the current version, the parser may be up to 16k further ahead.
this means that when you get a "start" event, the parser has often processed
everything inside the event (especially if it's small enough), but you cannot
rely on that.

or in other words, for a start event, the following attributes are valid:

    tags and attributes for parent elements (use a stack if you
        need to track them)
    (not elem.text)
    (not elem.tail)
    (not elem[:])
    you may modify the tag and attrib attributes
    you may stop parsing

and for an end event, the following applies:

    elem[:] (i.e. the children)
    complete contents for all children (including the tail)
    (not elem.tail) (but all child tails)
    you may modify all attributes, except elem.tail
    you may reorder/update children
    you may remove children (e.g. calling elem.clear() to mark that
        you're done with this level)
    you may stop parsing

clearer?  I think I need to draw a couple of diagrams...


More information about the XML-SIG mailing list