[Python-Dev] Fixing the XML batteries

Calvin Spealman ironfroggy at gmail.com
Mon Feb 6 13:48:00 CET 2012


On Dec 9, 2011 3:04 AM, "Stefan Behnel" <stefan_ml at behnel.de> wrote:
>
> Hi everyone,
>
> I think Py3.3 would be a good milestone for cleaning up the stdlib
support for XML. Note upfront: you may or may not know me as the maintainer
of lxml, the de-facto non-stdlib standard Python XML tool. This (lengthy)
post was triggered by the following kind of conversation that I keep having
with new XML users in Python (mostly on c.l.py), which hints at some
serious flaw in the stdlib.
>
> User: I'm trying to do XML stuff XYZ in Python and have problem ABC.
> Me: What library are you using? Could you show us some code?
> User: My code looks like this snippet: ...
> Me: You are using minidom which is known to be hard to use, slow and uses
lots of memory. Use the xml.etree.ElementTree package instead, or rather
its C implementation cElementTree, also in the stdlib.
> User (coming back after a while): thanks, that was exactly what [I didn't
know] I was looking for.
>
> What does this tell us?
>
> 1) MiniDOM is what new users find first. It's highly visible because
there are still lots of ancient "Python and XML" web pages out there that
date back from the time before Python 2.5 (or rather something like 2.2),
when it was the only XML tree library in the stdlib. It's also the first
hit from the top when you search for "XML" on the stdlib docs page and
contains the (to some people) familiar word "DOM", which lets users stop
their search and start writing code, not expecting to find a separate
alternative in the same stdlib, way further down. And the description as
"mini", "simple" and "lightweight" suggests to users that it's going to be
easy to use and efficient.
>
> 2) MiniDOM is not what users want. It leads to complicated, unpythonic
code and lots of problems. It is neither easy to use, nor efficient, nor
"lightweight", "simple" or "mini", not in absolute numbers (see
http://bugs.python.org/issue11379#msg148584 and following for a recent
discussion). It's also badly maintained in the sense that its performance
characteristics could likely be improved, but no-one is seriously
interested in doing that, because it would not lead to something that
actually *is* fast or memory friendly compared to any of the 'real'
alternatives that are available right now.
>
> 3) ElementTree is what users should use, MiniDOM is not. ElementTree was
added to the stdlib in Py2.5 on popular demand, exactly because it is very
easy to use, very fast, and very memory friendly. And because users did not
want to use MiniDOM any more. Today, ElementTree has a rather straight
upgrade path towards lxml.etree if more XML features like validation or
XSLT are needed. MiniDOM has nothing like that to offer. It's a dead end.
>
> 4) In the stdlib, cElementTree is independent of ElementTree, but totally
hidden in the documentation. In conversations like the above, it's
unnecessarily complex to explain to users that there is ElementTree (which
is documented in the stdlib), but that what they want to use is really
cElementTree, which has the same API but does not have a stdlib
documentation page that I can send them to. Note that the other Python
implementations simply provide cElementTree as an alias for ElementTree.
That leaves CPython as the only Python implementation that really has these
two separate modules.
>
> So, there are many problems here. And I think they make it unnecessarily
complicated for users to process XML in Python and that the current
situation helps in turning away new users from Python as a language for XML
processing. Python does have impressively great tools for working with XML.
It's just that the stdlib and its documentation do not reflect or even
appreciate that.
>
> What should change?
>
> a) The stdlib documentation should help users to choose the right tool
right from the start. Instead of using the totally misleading wording that
it uses now, it should be honest about the performance characteristics of
MiniDOM and should actively suggest that those who don't know what to
choose (or even *that* they can choose) should not use MiniDOM in the first
place. I created a ticket (issue11379) for a minor step in this direction,
but given the responses, I'm rather convinced that there's a lot more that
can be done and should be done, and that it should be done now, right for
the next release.
>
> b) cElementTree should finally loose it's "special" status as a separate
library and disappear as an accelerator module behind ElementTree. This has
been suggested a couple of times already, and AFAIR, there was some
opposition because 1) ET was maintained outside of the stdlib and 2) the
APIs of both were not identical. However, getting ET 1.3 into Py2.7 and 3.2
was a U-turn. Today, ET is *only* being maintained in the stdlib by Florent
Xicluna (who is doing a good job with it), and ET 1.3 has basically made
the APIs of both implementations compatible again. So, 3.3 would be the
right milestone for fixing the "two libs for one" quirk.
>
> Given that this is the third time during the last couple of years that
I'm suggesting to finally fix the stdlib and its documentation, I won't
provide any further patches before it has finally been accepted that a)
this is a problem and b) it should be fixed, thus allowing the patches to
actually serve a purpose. If we can agree on that, I'll happily help in
making this change happen.
>
> Stefan
>
>

this gets a strong +1 from me and, I suspect, anyone else who spends a
significant amount of time in any of the python support communities
(python-list, #python, etc). Defaults exist not only in our code, but also
in our documentation and presentation, and those defaults are wrong here.
_______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
http://mail.python.org/mailman/options/python-dev/ironfroggy%40gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20120206/5e30d3a4/attachment.html>


More information about the Python-Dev mailing list