lxml/ElementTree and .tail

Wed Nov 15 14:33:33 EST 2006

I looked around for an ElementTree-specific mailing list, but found  
none -- my apologies if this is too broad a forum for this question.

I've been using the lxml variant of the ElementTree API, which I  
understand works in much the same way (with some significant  
additions).  In particular, it shares the use of a .tail attribute.   
I ran headlong into this aspect of the API while doing some DOM  
manipulations, and it's got me pretty confused.

Example:

 >>> from lxml import etree as ET
 >>> frag = ET.XML('<a>head<b>inside</b>tail</a>')
 >>> b = frag.xpath('//b')[0]
 >>> b
<Element b at 71cbe8>
 >>> b.text
'inside'
 >>> b.tail
'tail'
 >>> frag.remove(b)
 >>> ET.tostring(frag)
'<a>head</a>'

As you can see, the .tail text is removed as part of the <b> element  
-- but it IS NOT part of the <b> element.  I understand the use of  
the .tail attribute given the desire to simplify the API by avoiding  
pure text nodes, but it seems entirely inappropriate for the tail  
text to disappear into the ether when what is technically a sibling  
node is removed.

Performing the same operations with the Java DOM api (crimson, in  
this case it turns out) yields what I would expect (here I'm using  
JPype to access a v1.4.2 JVM through python -- which makes things  
somewhat less painful):

 >>> from jpype import *
 >>> startJVM(getDefaultJVMPath())
 >>> builder = javax.xml.parsers.DocumentBuilderFactory.newInstance 
().newDocumentBuilder()
 >>> xml = java.io.ByteArrayInputStream(java.lang.String 
('<a>head<b>inside</b>tail</a>').getBytes())
 >>> doc = builder.parse(xml)
 >>> a = doc.documentElement
 >>> a.toString()
u'<a>head<b>inside</b>tail</a>'
 >>> b = a.getElementsByTagName('b').item(0)
 >>> a.removeChild(b)
 >>> a.toString()
u'<a>headtail</a>'

(Sorry for the Java comparison, but that's where I first cut my teeth  
on XML, and that's where my expectations were formed.)

That's a pretty significant mismatch in functionality.  I certainly  
understand the motivations of Mr. Lundh to make the ET API as  
pythonic as possible, but ET's behaviour in this specific context is  
flatly wrong as far as I can see.  I would have expected that a  
removal operation would have appended <b>'s tail text to the text of  
<a> (or perhaps to the tail text of <b>'s closest preceding sibling)  
-- something that I think I'm going to have to do in order to  
continue using lxml / ElementTree.

I ran this issue past a few people I know who've worked with and  
written about ElementTree, and their response to this apparent  
divergence between the ET DOM API and "standard" DOM APIs was  
roughly: "that's just the way it is".

Comments, thoughts?

Chas Emerick
Founder, Snowtide Informatics Systems
Enterprise-class PDF content extraction

cemerick at snowtide.com
http://snowtide.com | +1 413.519.6365