[Python-Dev] Fixing the XML batteries
Stefan Behnel
stefan_ml at behnel.de
Fri Dec 9 09:02:35 CET 2011
Hi everyone,
I think Py3.3 would be a good milestone for cleaning up the stdlib support
for XML. Note upfront: you may or may not know me as the maintainer of
lxml, the de-facto non-stdlib standard Python XML tool. This (lengthy) post
was triggered by the following kind of conversation that I keep having with
new XML users in Python (mostly on c.l.py), which hints at some serious
flaw in the stdlib.
User: I'm trying to do XML stuff XYZ in Python and have problem ABC.
Me: What library are you using? Could you show us some code?
User: My code looks like this snippet: ...
Me: You are using minidom which is known to be hard to use, slow and uses
lots of memory. Use the xml.etree.ElementTree package instead, or rather
its C implementation cElementTree, also in the stdlib.
User (coming back after a while): thanks, that was exactly what [I didn't
know] I was looking for.
What does this tell us?
1) MiniDOM is what new users find first. It's highly visible because there
are still lots of ancient "Python and XML" web pages out there that date
back from the time before Python 2.5 (or rather something like 2.2), when
it was the only XML tree library in the stdlib. It's also the first hit
from the top when you search for "XML" on the stdlib docs page and contains
the (to some people) familiar word "DOM", which lets users stop their
search and start writing code, not expecting to find a separate alternative
in the same stdlib, way further down. And the description as "mini",
"simple" and "lightweight" suggests to users that it's going to be easy to
use and efficient.
2) MiniDOM is not what users want. It leads to complicated, unpythonic code
and lots of problems. It is neither easy to use, nor efficient, nor
"lightweight", "simple" or "mini", not in absolute numbers (see
http://bugs.python.org/issue11379#msg148584 and following for a recent
discussion). It's also badly maintained in the sense that its performance
characteristics could likely be improved, but no-one is seriously
interested in doing that, because it would not lead to something that
actually *is* fast or memory friendly compared to any of the 'real'
alternatives that are available right now.
3) ElementTree is what users should use, MiniDOM is not. ElementTree was
added to the stdlib in Py2.5 on popular demand, exactly because it is very
easy to use, very fast, and very memory friendly. And because users did not
want to use MiniDOM any more. Today, ElementTree has a rather straight
upgrade path towards lxml.etree if more XML features like validation or
XSLT are needed. MiniDOM has nothing like that to offer. It's a dead end.
4) In the stdlib, cElementTree is independent of ElementTree, but totally
hidden in the documentation. In conversations like the above, it's
unnecessarily complex to explain to users that there is ElementTree (which
is documented in the stdlib), but that what they want to use is really
cElementTree, which has the same API but does not have a stdlib
documentation page that I can send them to. Note that the other Python
implementations simply provide cElementTree as an alias for ElementTree.
That leaves CPython as the only Python implementation that really has these
two separate modules.
So, there are many problems here. And I think they make it unnecessarily
complicated for users to process XML in Python and that the current
situation helps in turning away new users from Python as a language for XML
processing. Python does have impressively great tools for working with XML.
It's just that the stdlib and its documentation do not reflect or even
appreciate that.
What should change?
a) The stdlib documentation should help users to choose the right tool
right from the start. Instead of using the totally misleading wording that
it uses now, it should be honest about the performance characteristics of
MiniDOM and should actively suggest that those who don't know what to
choose (or even *that* they can choose) should not use MiniDOM in the first
place. I created a ticket (issue11379) for a minor step in this direction,
but given the responses, I'm rather convinced that there's a lot more that
can be done and should be done, and that it should be done now, right for
the next release.
b) cElementTree should finally loose it's "special" status as a separate
library and disappear as an accelerator module behind ElementTree. This has
been suggested a couple of times already, and AFAIR, there was some
opposition because 1) ET was maintained outside of the stdlib and 2) the
APIs of both were not identical. However, getting ET 1.3 into Py2.7 and 3.2
was a U-turn. Today, ET is *only* being maintained in the stdlib by Florent
Xicluna (who is doing a good job with it), and ET 1.3 has basically made
the APIs of both implementations compatible again. So, 3.3 would be the
right milestone for fixing the "two libs for one" quirk.
Given that this is the third time during the last couple of years that I'm
suggesting to finally fix the stdlib and its documentation, I won't provide
any further patches before it has finally been accepted that a) this is a
problem and b) it should be fixed, thus allowing the patches to actually
serve a purpose. If we can agree on that, I'll happily help in making this
change happen.
Stefan
More information about the Python-Dev
mailing list