[XML-SIG] Re: cElementTree.iterparse missing text in some start
events
Fredrik Lundh
fredrik at pythonware.com
Tue Jan 25 09:09:16 CET 2005
Jimmy Retzlaff wrote:
> I'm using cElementTree.iterparse to iterate over an XML file. I think
> iterparse is a wonderful idea - I've found it to be much more convenient
> than SAX for iterative processing. I have come across a problem
> though...
>
> For the majority of my elements, both the start and end events contain
> the text of the element (i.e., element.text). For a handful of the
> elements, the text is only in the end event (i.e., element.text is None
> in the start event but it is not None in the end event). The text is
> found without any problem when using cElementTree.parse on the file
> instead.
> Am I misunderstanding something or is this perhaps a bug?
it needs more documentation ;-)
here's what the comment in the CHANGES document says:
The elem object is the current element; for "start" events,
the element itself has been created (including attributes), but its
contents may not be complete; for "end" events, all child elements
has been processed as well. You can use "start" tags to count
elements, check attributes, and check if certain tags are present
in a tree. For all other purposes, use "end" handlers instead.
in that text, "may not" really means "may or may not". that is, the contents
may be complete, but that's nothing you can or should rely on.
the reason for this is that events don't fire in perfect lockstep with the build
process; in the current version, the parser may be up to 16k further ahead.
this means that when you get a "start" event, the parser has often processed
everything inside the event (especially if it's small enough), but you cannot
rely on that.
or in other words, for a start event, the following attributes are valid:
elem.tag
elem.attrib
tags and attributes for parent elements (use a stack if you
need to track them)
(not elem.text)
(not elem.tail)
(not elem[:])
you may modify the tag and attrib attributes
you may stop parsing
and for an end event, the following applies:
elem.tag
elem.attrib
elem.text
elem[:] (i.e. the children)
complete contents for all children (including the tail)
(not elem.tail) (but all child tails)
you may modify all attributes, except elem.tail
you may reorder/update children
you may remove children (e.g. calling elem.clear() to mark that
you're done with this level)
you may stop parsing
clearer? I think I need to draw a couple of diagrams...
</F>
More information about the XML-SIG
mailing list