cElementTree clear semantics
Fredrik Lundh
fredrik at pythonware.com
Sun Sep 25 15:18:38 EDT 2005
Igor V. Rafienko wrote:
> Finally, I thought about keeping track of when to clear and when not
> to by subscribing to start and end elements (so that I would collect
> the entire <schnappi>-subtree in memory and only than release it):
>
> from cElementTree import iterparse
> clear_flag = True
> for event, elem in iterparse("data.xml", ("start", "end")):
> if event == "start" and elem.tag == "schnappi":
> # start collecting elements
> clear_flag = False
> if event == "end" and elem.tag == "schnappi":
> clear_flag = True
> # do something with elem
> # unless we are collecting elements, clear()
> if clear_flag:
> elem.clear()
>
> This gave me the desired behaviour, but:
>
> * It looks *very* ugly
> * It's twice as slow as version which sees 'end'-events only.
>
> Now, there *has* to be a better way. What am I missing?
the iterparse/clear approach works best if your XML file has a
record-like structure. if you have toplevel records with lots of
schnappi records in them, iterate over the records and use find
(etc) to locate the subrecords you're interested in:
for event, elem in iterparse("data.xml"):
if event.tag == "record":
# deal with schnappi subrecords
for schappi in elem.findall(".//schnappi"):
process(schnappi)
elem.clear()
the collect flag approach isn't that bad ("twice as slow" doesn't
really say much: "raw" cElementTree is extremely fast compared
to the Python interpreter, so everything you end up doing in
Python will slow things down quite a bit).
to make your application code look a bit less convoluted, put the
logic in a generator function:
# in library
def process(filename, annoying_animal):
clear = True
start = "start"; end = "end"
for event, elem in iterparse(filename, (start, end)):
if elem.tag == annoying_animal:
if event is start:
clear = False
else:
yield elem
clear = True
if clear:
elem.clear()
# in application
for subelem in process(filename, "schnappi"):
# do something with subelem
(I've reorganized the code a bit to cut down on the operations.
also note the "is" trick; iterparse returns the event strings you
pass in, so comparing on object identities is safe)
an alternative is to use the lower-level XMLParser class (which
is similar to SAX, but faster), but that will most likely result in
more and tricker Python code...
</F>
More information about the Python-list
mailing list