lxml/ElementTree and .tail
Chas Emerick
cemerick at snowtide.com
Wed Nov 15 14:33:33 EST 2006
I looked around for an ElementTree-specific mailing list, but found
none -- my apologies if this is too broad a forum for this question.
I've been using the lxml variant of the ElementTree API, which I
understand works in much the same way (with some significant
additions). In particular, it shares the use of a .tail attribute.
I ran headlong into this aspect of the API while doing some DOM
manipulations, and it's got me pretty confused.
Example:
>>> from lxml import etree as ET
>>> frag = ET.XML('<a>head<b>inside</b>tail</a>')
>>> b = frag.xpath('//b')[0]
>>> b
<Element b at 71cbe8>
>>> b.text
'inside'
>>> b.tail
'tail'
>>> frag.remove(b)
>>> ET.tostring(frag)
'<a>head</a>'
As you can see, the .tail text is removed as part of the <b> element
-- but it IS NOT part of the <b> element. I understand the use of
the .tail attribute given the desire to simplify the API by avoiding
pure text nodes, but it seems entirely inappropriate for the tail
text to disappear into the ether when what is technically a sibling
node is removed.
Performing the same operations with the Java DOM api (crimson, in
this case it turns out) yields what I would expect (here I'm using
JPype to access a v1.4.2 JVM through python -- which makes things
somewhat less painful):
>>> from jpype import *
>>> startJVM(getDefaultJVMPath())
>>> builder = javax.xml.parsers.DocumentBuilderFactory.newInstance
().newDocumentBuilder()
>>> xml = java.io.ByteArrayInputStream(java.lang.String
('<a>head<b>inside</b>tail</a>').getBytes())
>>> doc = builder.parse(xml)
>>> a = doc.documentElement
>>> a.toString()
u'<a>head<b>inside</b>tail</a>'
>>> b = a.getElementsByTagName('b').item(0)
>>> a.removeChild(b)
>>> a.toString()
u'<a>headtail</a>'
(Sorry for the Java comparison, but that's where I first cut my teeth
on XML, and that's where my expectations were formed.)
That's a pretty significant mismatch in functionality. I certainly
understand the motivations of Mr. Lundh to make the ET API as
pythonic as possible, but ET's behaviour in this specific context is
flatly wrong as far as I can see. I would have expected that a
removal operation would have appended <b>'s tail text to the text of
<a> (or perhaps to the tail text of <b>'s closest preceding sibling)
-- something that I think I'm going to have to do in order to
continue using lxml / ElementTree.
I ran this issue past a few people I know who've worked with and
written about ElementTree, and their response to this apparent
divergence between the ET DOM API and "standard" DOM APIs was
roughly: "that's just the way it is".
Comments, thoughts?
Chas Emerick
Founder, Snowtide Informatics Systems
Enterprise-class PDF content extraction
cemerick at snowtide.com
http://snowtide.com | +1 413.519.6365
More information about the Python-list
mailing list