[Python-Dev] Fixing the XML batteries

Fri Dec 9 09:02:35 CET 2011

Hi everyone,

I think Py3.3 would be a good milestone for cleaning up the stdlib support 
for XML. Note upfront: you may or may not know me as the maintainer of 
lxml, the de-facto non-stdlib standard Python XML tool. This (lengthy) post 
was triggered by the following kind of conversation that I keep having with 
new XML users in Python (mostly on c.l.py), which hints at some serious 
flaw in the stdlib.

User: I'm trying to do XML stuff XYZ in Python and have problem ABC.
Me: What library are you using? Could you show us some code?
User: My code looks like this snippet: ...
Me: You are using minidom which is known to be hard to use, slow and uses 
lots of memory. Use the xml.etree.ElementTree package instead, or rather 
its C implementation cElementTree, also in the stdlib.
User (coming back after a while): thanks, that was exactly what [I didn't 
know] I was looking for.

What does this tell us?

1) MiniDOM is what new users find first. It's highly visible because there 
are still lots of ancient "Python and XML" web pages out there that date 
back from the time before Python 2.5 (or rather something like 2.2), when 
it was the only XML tree library in the stdlib. It's also the first hit 
from the top when you search for "XML" on the stdlib docs page and contains 
the (to some people) familiar word "DOM", which lets users stop their 
search and start writing code, not expecting to find a separate alternative 
in the same stdlib, way further down. And the description as "mini", 
"simple" and "lightweight" suggests to users that it's going to be easy to 
use and efficient.

2) MiniDOM is not what users want. It leads to complicated, unpythonic code 
and lots of problems. It is neither easy to use, nor efficient, nor 
"lightweight", "simple" or "mini", not in absolute numbers (see 
http://bugs.python.org/issue11379#msg148584 and following for a recent 
discussion). It's also badly maintained in the sense that its performance 
characteristics could likely be improved, but no-one is seriously 
interested in doing that, because it would not lead to something that 
actually *is* fast or memory friendly compared to any of the 'real' 
alternatives that are available right now.

3) ElementTree is what users should use, MiniDOM is not. ElementTree was 
added to the stdlib in Py2.5 on popular demand, exactly because it is very 
easy to use, very fast, and very memory friendly. And because users did not 
want to use MiniDOM any more. Today, ElementTree has a rather straight 
upgrade path towards lxml.etree if more XML features like validation or 
XSLT are needed. MiniDOM has nothing like that to offer. It's a dead end.

4) In the stdlib, cElementTree is independent of ElementTree, but totally 
hidden in the documentation. In conversations like the above, it's 
unnecessarily complex to explain to users that there is ElementTree (which 
is documented in the stdlib), but that what they want to use is really 
cElementTree, which has the same API but does not have a stdlib 
documentation page that I can send them to. Note that the other Python 
implementations simply provide cElementTree as an alias for ElementTree. 
That leaves CPython as the only Python implementation that really has these 
two separate modules.

So, there are many problems here. And I think they make it unnecessarily 
complicated for users to process XML in Python and that the current 
situation helps in turning away new users from Python as a language for XML 
processing. Python does have impressively great tools for working with XML. 
It's just that the stdlib and its documentation do not reflect or even 
appreciate that.

What should change?

a) The stdlib documentation should help users to choose the right tool 
right from the start. Instead of using the totally misleading wording that 
it uses now, it should be honest about the performance characteristics of 
MiniDOM and should actively suggest that those who don't know what to 
choose (or even *that* they can choose) should not use MiniDOM in the first 
place. I created a ticket (issue11379) for a minor step in this direction, 
but given the responses, I'm rather convinced that there's a lot more that 
can be done and should be done, and that it should be done now, right for 
the next release.

b) cElementTree should finally loose it's "special" status as a separate 
library and disappear as an accelerator module behind ElementTree. This has 
been suggested a couple of times already, and AFAIR, there was some 
opposition because 1) ET was maintained outside of the stdlib and 2) the 
APIs of both were not identical. However, getting ET 1.3 into Py2.7 and 3.2 
was a U-turn. Today, ET is *only* being maintained in the stdlib by Florent 
Xicluna (who is doing a good job with it), and ET 1.3 has basically made 
the APIs of both implementations compatible again. So, 3.3 would be the 
right milestone for fixing the "two libs for one" quirk.

Given that this is the third time during the last couple of years that I'm 
suggesting to finally fix the stdlib and its documentation, I won't provide 
any further patches before it has finally been accepted that a) this is a 
problem and b) it should be fixed, thus allowing the patches to actually 
serve a purpose. If we can agree on that, I'll happily help in making this 
change happen.

Stefan