lxml/ElementTree and .tail

Fredrik Lundh fredrik at pythonware.com
Sat Nov 18 17:29:08 CET 2006

Chas Emerick wrote:

>> and keep patting our-
>> selves on the back, while the rest of the world is busy routing around
>> us, switching to well-understood XML subsets or other serialization
>> formats, simpler and more flexible data models, simpler API:s, and
>> more robust code.  and Python ;-)
> That's flatly unrealistic.  If you'll remember, I'm not one of "those  
> people" that are specification-driven -- I hadn't even *heard* of  
> Infoset until earlier this week!

The rant wasn't directed at you or anyone special, but I don't really 
think you got the point of it either.  Which is a bit strange, because 
it sounded like you *were* working on extracting information from messy 
documents, so the "it's about the data, dammit" way of thinking 
shouldn't be news to you.

And the routing around is not unrealistic, it's is a *fact*; JSON and 
POX are killing the full XML/Schema/SOAP stack for communication, XHTML 
is pretty much dead as a wire format, people are apologizing in public 
for their use of SOAP, AJAX is quickly turning into AJAJ, few people 
care about the more obscure details of the XML 1.0 standard (when did 
you last see a conditional section? or even a DTD?), dealing with huge 
XML data sets is still extremely hard compared to just uploading the 
darn thing to a database and doing the crunching in SQL, and nobody uses 
XML 1.1 for anything.

Practicality beats purity, and the Internet routes around damage, every 
single time.

 > overwhelming majority of the developers out there care for nothing
 > but the serialization, simply because that's how one plays nicely
 > with others.

The problem is if you only stare at the serialization, your code *won't* 
play nicely with others.  At the serialization level, it's easy to think 
that CDATA sections are different from other text, that character 
references are different from ordinary characters, that you should 
somehow be able to distinguish between <tag></tag> and <tag/>, that 
namespace prefixes are more important than the namespace URI, that an 
  in an XHTML-style stream is different from a U+00A0 character in 
memory, and so on.  In my experience, serialization-only thinking (at 
the receiving end) is the single most common cause for interoperability 
problems when it comes to general XML interchange.

But when you focus on the data model, and treat the serialization as an 
implementation detail, to be addressed by a library written by someone 
who's actually read the specifications a few more times than you have, 
all those problems tend to just go away.  Things just work.

And in practice, of course, most software engineers understand this, and 
care about this.  After all, good software engineering is about 
abstractions and decoupling and designing things so you can focus on one 
part of the problem at a time.  And about making your customer happy, 
and having fun while doing that.  Not staying up all night to look for 
an obscure interoperability problem that you finally discover is caused 
by someone using a CDATA section where you expected a character 
reference, in 0.1% of all production records, but in none of the files 
in your test data set.

(By the way, did ET fail to *read* your XML documents?  I thought your 
complaint was that it didn't put the things it read in a place where you 
expected them to be, and that you didn't have time to learn how to deal 
with that because you had more important things to do, at the time?)


More information about the Python-list mailing list