lxml/ElementTree and .tail
fredrik at pythonware.com
Sat Nov 18 17:29:08 CET 2006
Chas Emerick wrote:
>> and keep patting our-
>> selves on the back, while the rest of the world is busy routing around
>> us, switching to well-understood XML subsets or other serialization
>> formats, simpler and more flexible data models, simpler API:s, and
>> more robust code. and Python ;-)
> That's flatly unrealistic. If you'll remember, I'm not one of "those
> people" that are specification-driven -- I hadn't even *heard* of
> Infoset until earlier this week!
The rant wasn't directed at you or anyone special, but I don't really
think you got the point of it either. Which is a bit strange, because
it sounded like you *were* working on extracting information from messy
documents, so the "it's about the data, dammit" way of thinking
shouldn't be news to you.
And the routing around is not unrealistic, it's is a *fact*; JSON and
POX are killing the full XML/Schema/SOAP stack for communication, XHTML
is pretty much dead as a wire format, people are apologizing in public
for their use of SOAP, AJAX is quickly turning into AJAJ, few people
care about the more obscure details of the XML 1.0 standard (when did
you last see a conditional section? or even a DTD?), dealing with huge
XML data sets is still extremely hard compared to just uploading the
darn thing to a database and doing the crunching in SQL, and nobody uses
XML 1.1 for anything.
Practicality beats purity, and the Internet routes around damage, every
> overwhelming majority of the developers out there care for nothing
> but the serialization, simply because that's how one plays nicely
> with others.
The problem is if you only stare at the serialization, your code *won't*
play nicely with others. At the serialization level, it's easy to think
that CDATA sections are different from other text, that character
references are different from ordinary characters, that you should
somehow be able to distinguish between <tag></tag> and <tag/>, that
namespace prefixes are more important than the namespace URI, that an
in an XHTML-style stream is different from a U+00A0 character in
memory, and so on. In my experience, serialization-only thinking (at
the receiving end) is the single most common cause for interoperability
problems when it comes to general XML interchange.
But when you focus on the data model, and treat the serialization as an
implementation detail, to be addressed by a library written by someone
who's actually read the specifications a few more times than you have,
all those problems tend to just go away. Things just work.
And in practice, of course, most software engineers understand this, and
care about this. After all, good software engineering is about
abstractions and decoupling and designing things so you can focus on one
part of the problem at a time. And about making your customer happy,
and having fun while doing that. Not staying up all night to look for
an obscure interoperability problem that you finally discover is caused
by someone using a CDATA section where you expected a character
reference, in 0.1% of all production records, but in none of the files
in your test data set.
(By the way, did ET fail to *read* your XML documents? I thought your
complaint was that it didn't put the things it read in a place where you
expected them to be, and that you didn't have time to learn how to deal
with that because you had more important things to do, at the time?)
More information about the Python-list