Fixing the XML batteries
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi everyone, I think Py3.3 would be a good milestone for cleaning up the stdlib support for XML. Note upfront: you may or may not know me as the maintainer of lxml, the de-facto non-stdlib standard Python XML tool. This (lengthy) post was triggered by the following kind of conversation that I keep having with new XML users in Python (mostly on c.l.py), which hints at some serious flaw in the stdlib. User: I'm trying to do XML stuff XYZ in Python and have problem ABC. Me: What library are you using? Could you show us some code? User: My code looks like this snippet: ... Me: You are using minidom which is known to be hard to use, slow and uses lots of memory. Use the xml.etree.ElementTree package instead, or rather its C implementation cElementTree, also in the stdlib. User (coming back after a while): thanks, that was exactly what [I didn't know] I was looking for. What does this tell us? 1) MiniDOM is what new users find first. It's highly visible because there are still lots of ancient "Python and XML" web pages out there that date back from the time before Python 2.5 (or rather something like 2.2), when it was the only XML tree library in the stdlib. It's also the first hit from the top when you search for "XML" on the stdlib docs page and contains the (to some people) familiar word "DOM", which lets users stop their search and start writing code, not expecting to find a separate alternative in the same stdlib, way further down. And the description as "mini", "simple" and "lightweight" suggests to users that it's going to be easy to use and efficient. 2) MiniDOM is not what users want. It leads to complicated, unpythonic code and lots of problems. It is neither easy to use, nor efficient, nor "lightweight", "simple" or "mini", not in absolute numbers (see http://bugs.python.org/issue11379#msg148584 and following for a recent discussion). It's also badly maintained in the sense that its performance characteristics could likely be improved, but no-one is seriously interested in doing that, because it would not lead to something that actually *is* fast or memory friendly compared to any of the 'real' alternatives that are available right now. 3) ElementTree is what users should use, MiniDOM is not. ElementTree was added to the stdlib in Py2.5 on popular demand, exactly because it is very easy to use, very fast, and very memory friendly. And because users did not want to use MiniDOM any more. Today, ElementTree has a rather straight upgrade path towards lxml.etree if more XML features like validation or XSLT are needed. MiniDOM has nothing like that to offer. It's a dead end. 4) In the stdlib, cElementTree is independent of ElementTree, but totally hidden in the documentation. In conversations like the above, it's unnecessarily complex to explain to users that there is ElementTree (which is documented in the stdlib), but that what they want to use is really cElementTree, which has the same API but does not have a stdlib documentation page that I can send them to. Note that the other Python implementations simply provide cElementTree as an alias for ElementTree. That leaves CPython as the only Python implementation that really has these two separate modules. So, there are many problems here. And I think they make it unnecessarily complicated for users to process XML in Python and that the current situation helps in turning away new users from Python as a language for XML processing. Python does have impressively great tools for working with XML. It's just that the stdlib and its documentation do not reflect or even appreciate that. What should change? a) The stdlib documentation should help users to choose the right tool right from the start. Instead of using the totally misleading wording that it uses now, it should be honest about the performance characteristics of MiniDOM and should actively suggest that those who don't know what to choose (or even *that* they can choose) should not use MiniDOM in the first place. I created a ticket (issue11379) for a minor step in this direction, but given the responses, I'm rather convinced that there's a lot more that can be done and should be done, and that it should be done now, right for the next release. b) cElementTree should finally loose it's "special" status as a separate library and disappear as an accelerator module behind ElementTree. This has been suggested a couple of times already, and AFAIR, there was some opposition because 1) ET was maintained outside of the stdlib and 2) the APIs of both were not identical. However, getting ET 1.3 into Py2.7 and 3.2 was a U-turn. Today, ET is *only* being maintained in the stdlib by Florent Xicluna (who is doing a good job with it), and ET 1.3 has basically made the APIs of both implementations compatible again. So, 3.3 would be the right milestone for fixing the "two libs for one" quirk. Given that this is the third time during the last couple of years that I'm suggesting to finally fix the stdlib and its documentation, I won't provide any further patches before it has finally been accepted that a) this is a problem and b) it should be fixed, thus allowing the patches to actually serve a purpose. If we can agree on that, I'll happily help in making this change happen. Stefan
data:image/s3,"s3://crabby-images/58a0b/58a0be886f0375938476d3eb7345a8b9d8cdc91e" alt=""
I disagree. The right approach is not to document performance problems, but to fix them.
Unfortunately (?), there is a near-contract-like agreement with Fredrik Lundh that any significant changes to ElementTree in the standard library have to be agreed by him. So whatever change you plan: make sure Fredrik gives his explicit support. Regards, Martin
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
"Martin v. Löwis", 14.12.2011 19:14:
I meant: "lack of interest in improving them". It's clear from the discussion that there are still users and that new code is still being written that uses MiniDOM. However, I would argue that this cannot possibly be performance critical code and that it only deals with somewhat small documents. I say that because MiniDOM is evidently not suitable for large documents or performance critical applications, so this is the only explanation I have why the performance problems would not be obvious in the cases where it is still being used. And if they do show, it appears to be much more likely that users rewrite their code using ElementTree or lxml than that they try to fix MiniDOM's performance issues. Now, read my first quote above again (and preferably also its context, which I already emphasized in a previous post), it should be clearer now. Stefan
data:image/s3,"s3://crabby-images/46dc6/46dc618d3e52171111ae75db482ab8f02667c0e6" alt=""
On 2011-12-14, at 20:41 , Stefan Behnel wrote:
I meant: "lack of interest in improving them". It's clear from the discussion that there are still users and that new code is still being written that uses MiniDOM. However, I would argue that this cannot possibly be performance critical code and that it only deals with somewhat small documents. I say that because MiniDOM is evidently not suitable for large documents or performance critical applications, so this is the only explanation I have why the performance problems would not be obvious in the cases where it is still being used. And if they do show, it appears to be much more likely that users rewrite their code using ElementTree or lxml than that they try to fix MiniDOM's performance issues. Could also be because "XML is slow (and sucks)" is part of the global consciousness at this point, and that minidom is slow and verbose doesn't surprise much.
data:image/s3,"s3://crabby-images/58a0b/58a0be886f0375938476d3eb7345a8b9d8cdc91e" alt=""
Am 14.12.2011 20:41, schrieb Stefan Behnel:
That's also what I meant. I'm interested in improving them.
Now, read my first quote above again (and preferably also its context, which I already emphasized in a previous post), it should be clearer now.
I (now) know what you mean - but you are incorrect. Regards, Martin
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
"Martin v. Löwis", 14.12.2011 22:20:
Then please do. I posted the numbers, so you know what the baseline is, both in terms of speed and memory usage. If you need further benchmarks of other areas of the API (e.g. tag search or whatever), just ask. Note, however, that even an improvement by an order of magnitude wouldn't solve the API issue for new users, so I'd still suggest to add an appropriate link towards ET to the MiniDOM documentation. Stefan
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Stefan Behnel, 14.12.2011 20:41:
Out of curiosity, I reran my benchmarks under PyPy 1.7. http://blog.behnel.de/index.php?p=210 In short: MiniDOM performs substantially better there, both in terms of time and space. That by itself doesn't make PyPy an interesting platform for XML processing (using lxml in CPython is way faster), but I found it interesting to note that the problem is not strictly inherent in MiniDOM. It also depends a lot on the runtime environment, even when it comes to memory usage. Stefan
data:image/s3,"s3://crabby-images/9f3d7/9f3d7dabb1d64eb02f4810f65bf815f8e703157b" alt=""
On 2011-12-09, at 09:41 , Martin v. Löwis wrote:
Minidom is inferior in interface flow and pythonicity, in terseness, in speed, in memory consumption (even more so using cElementTree, and that's not something which can be fixed unless minidom gets a C accelerator), etc… Even after fixing minidom (if anybody has the time and drive to commit to it), ET/cET should be preferred over it. And that's not even considering the ease of switching to lxml (if only for validators), which Stefan outlined. [0] not 100% true now that I think about it: handling mixed content is simpler in minidom as there is no .text/.tail duality and text nodes are nodes like every other, but I really can't think of an other reason to prefer minidom
data:image/s3,"s3://crabby-images/58a0b/58a0be886f0375938476d3eb7345a8b9d8cdc91e" alt=""
Am 09.12.2011 10:09, schrieb Xavier Morel:
I don't mind pointing people to ElementTree, despite that I disagree whether the ET interface is "superior" to DOM. It's Stefan's reasoning as to *why* people should be pointed to ET, and what words should be used to do that. IOW, I detest bashing some part of the standard library, just to urge users to use some other part of the standard library. People are still using PyXML, despite it's not being maintained anymore. Telling them to replace 4DOM with minidom is much more appropriate than telling them to rewrite in ET. Regards, Martin
data:image/s3,"s3://crabby-images/9f3d7/9f3d7dabb1d64eb02f4810f65bf815f8e703157b" alt=""
On 2011-12-11, at 23:03 , Martin v. Löwis wrote:
From my understanding, Stefan's suggestion is mostly aimed at "new" python users trying to manipulate XML and not knowing what to use (yet). It's not about telling people to rewrite existing codebase (it's a good idea as well when possible, as far as I'm concerned, but it's a different issue).
data:image/s3,"s3://crabby-images/dd81a/dd81a0b0c00ff19c165000e617f6182a8ea63313" alt=""
Martin, You seem heavily invested in minidom. In the near future I will need to parse and rewrite parts of an xml file created by a third-party program (PrintShopMail, for the curious). It contains both binary and textual data. Would you recommend minidom for this purpose? What other purposes would you recommend minidom for? xml-confused-ly yours, ~Ethan~ (Comments by others are, of course, also welcome. :)
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
"Martin v. Löwis", 11.12.2011 23:03:
Yes, that's clearly a point where we agree to disagree, and I understand that you are as biased towards minidom as I am biased towards ElementTree. However, I think I made it clear that the implementation of cElementTree (and lxml.etree as well, for that purpose) is largely superiour to MiniDOM in terms of performance, for any sensible meaning of the word performance. And I'm also convinced that the API is largely superiour in terms of usability. ET certainly matches Python as a language much better than MiniDOM. But that's just my personal opinion.
I'm all for finding a good way of putting it into words, as long as it keeps uninformed users from taking the wrong decision and getting the wrong idea of how complicated and slow Python is.
People are still using PyXML, despite it's not being maintained anymore.
My experience with that is that it's only *new* users that are still running into PyXML by accident, because they didn't see that it's a dead project and they find it through ancient web pages that tell them that they need it because "it's the way to do XML in Python" and "if minidom is not enough, use PyXML". Maybe we should "misuse" the stdlib documentation to clear that up as well. "PyXML" is just too attractive a name for a dead project. Just look through the xml-sig page, basically all requests regarding PyXML during the last five years deal with problems in installing it, i.e. *before* even starting to use it. So you can't use this to claim that people really *are* still using it.
Telling them to replace 4DOM with minidom is much more appropriate
Do you actually have any evidence that anyone is still actively using 4DOM?
than telling them to rewrite in ET.
I usually encourage people to rewrite minidom code for ET. It makes the code simpler, more readable, more maintainable and much faster. Stefan
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Fri, Dec 9, 2011 at 6:41 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
When we offer a better way to do something that new users are want to do, we generally redirect them to the more recent alternative. I believe the redirection from the getopt module to the argparse module strikes the right tone for that kind of thing: http://docs.python.org/library/getopt For the various XML libraries, a message along the lines of "Note: The <whatever> module is a <yada, yada, DOM based, whatever>. If all you are trying to do is read and write XML files, consider using the xml.etree.ElementTree module instead". I'd also be +1 on adjusting the order of the XML pages in the main index such that xml.etree.ElementTree appeared before xml.parser.expat and all the others slid down one entry. These are simple changes that don't harm current users of the modules in the least, while being up front and very helpful for beginners. Again, I think argparse vs getopt is a good comparison: argparse appears first in the main index, and there's a redirection from getopt to argparse that says "if you don't have a specific reason to be using getopt, you probably want argparse instead". -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
data:image/s3,"s3://crabby-images/8a956/8a956617e1584d7190ba6141649d4f0c0a5901cb" alt=""
On Fri, Dec 9, 2011 at 09:02, Stefan Behnel <stefan_ml@behnel.de> wrote:
An at least somewhat informed +1 from me. The ElementTree API is a very good way to deal with XML from Python, and it deserves to be promoted over the included alternatives. Let's deprecate the NiCad batteries and try to guide users toward the Li-Ion ones. Cheers, Dirkjan
data:image/s3,"s3://crabby-images/9f3d0/9f3d02f3375786c1b9e625fe336e3e9dfd7b0234" alt=""
On Fri, 09 Dec 2011 09:02:35 +0100 Stefan Behnel <stefan_ml@behnel.de> wrote:
+1 and +1. I've done a lot of xml work in Python, and unless you've got a particular reason for wanting to use the dom, ElementTree is the only sane way to go. I recently converted a middling-sized app from using the dom to using ElementTree, and wrote up some guidelines for the process for the client. I can try and shake it out of my clients lawyers if it would help with this or others are interested. <mike
data:image/s3,"s3://crabby-images/b2012/b20127a966d99eea8598511fc82e29f8d180df6c" alt=""
Mike Meyer <mwm@mired.org> wrote:
I use ElementTree for parsing valid XML, but minidom for producing it. I think another thing that might go into "refreshing the batteries" is a feature comparison of BeautifulSoup and HTML5lib against the stdlib competition, to see what needs to be added/revised. Having to switch to an outside package for parsing possibly invalid HTML is a pain. Bill
data:image/s3,"s3://crabby-images/8e91b/8e91bd2597e9c25a0a8c3497599699707003a9e9" alt=""
On 9 December 2011 18:15, Bill Janssen <janssen@parc.com> wrote:
For what little use I make of XML/HTML parsing, I use lxml, simply because it has a parser that covers the sort of HTML I have to deal with in real life. As I have lxml installed, I use it for any XML parsing tasks, just because I'm used to it. Paul
data:image/s3,"s3://crabby-images/b2012/b20127a966d99eea8598511fc82e29f8d180df6c" alt=""
Xavier Morel <python-dev@masklinn.net> wrote:
Inertia, I guess. I tried that first, and it seems to work. I tend to use html5lib and/or BeautifulSoup instead of ElementTree, and that's mainly because I find the documentation for ElementTree is confusing and partial and inconsistent. Having various undated but obsolete tutorials and documentation still up on effbot.org doesn't help. Bill
data:image/s3,"s3://crabby-images/33866/338662e5c8c36c53d24ab18f081cc3f7f9ce8b18" alt=""
On Sat, Dec 10, 2011 at 00:43, Matt Joiner <anacrolix@gmail.com> wrote:
I second this. The doco is very bad.
It would be constructive to open issues for specific problems in the documentation. I'm sure this won't be hard to fix. Documentation should not be the roadblock for using a library. Eli
data:image/s3,"s3://crabby-images/89582/895822069909929e7f9b43941b4a659ad715607a" alt=""
On Fri, 2011-12-09 at 19:39 +0100, Xavier Morel wrote:
To throw my 2c in here: I personally normally use minidom for manipulating (x)html data (through html5lib), and for writing XML. I think it's primarily because DOM: a) matches the way I think about XML documents. b) Provides the same API as I use in other languages. (FWIW, I do a lot of DOM manipulation in javascript) c) "Feels" (to me) more similar to other formats I work with. All three may be because I haven't spent enough time with ElementTree - again I've found the documentation lacking. Tim
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Bill Janssen, 09.12.2011 19:15:
Such a feature request should be worth a separate thread. Note, however, that html5lib is likely way too big to add it to the stdlib, and that BeautifulSoup lacks a parser for non-conforming HTML in Python 3, which would be the target release series for better HTML support. So, whatever library or API you would want to use for HTML processing is currently only the second question as long as Py3 lacks a real-world HTML parser in the stdlib, as well as a robust character detection mechanism. I don't think that can be fixed all that easily. Stefan
data:image/s3,"s3://crabby-images/b2012/b20127a966d99eea8598511fc82e29f8d180df6c" alt=""
Stefan Behnel <stefan_ml@behnel.de> wrote:
Sounds like it needs a PEP. I'm only advocating spending some thought on what needs to be done -- whether outside libraries need to be adopted into the stdlib would be a step after that. But understanding *why* those libraries exist and are widely used should be a prerequisite to "refreshing" the stdlib's support. Bill
data:image/s3,"s3://crabby-images/9dd1d/9dd1dec091b1b438e36e320a5558f7d624f6cb3e" alt=""
On Dec 10, 2011, at 2:38 AM, Stefan Behnel wrote:
Note, however, that html5lib is likely way too big to add it to the stdlib, and that BeautifulSoup lacks a parser for non-conforming HTML in Python 3, which would be the target release series for better HTML support. So, whatever library or API you would want to use for HTML processing is currently only the second question as long as Py3 lacks a real-world HTML parser in the stdlib, as well as a robust character detection mechanism. I don't think that can be fixed all that easily.
Here's the problem in a nutshell, I think: Everybody wants an HTML parser in the stdlib, because it's inconvenient to pull in a dependency for such a "simple" task. Everybody wants the stdlib to remain small, stable, and simple and not get "overcomplicated". Parsing arbitrary HTML5 is a monstrously complex problem, for which there exist rapidly-evolving standards and libraries to deal with it. Parsing 'the web' (which is rapidly growing to include stuff like SVG, MathML etc) is even harder. My personal opinion is that HTML5Lib gets this problem almost completely right, and so it should be absorbed by the stdlib. Trying to re-invent this from scratch, or even use something like BeautifulSoup which uses a bunch of heuristics and hacks rather than reference to the laboriously-crafted standard that says exactly how parsing malformed stuff has to go to be "like a browser", seems like it will just give the stdlib solution a reputation for working on the test input but not working in the real world. (No disrespect to BeautifulSoup: it was a great attempt in the pre-HTML5 world which it was born into, and I've used it numerous times to implement useful things. But much more effort has been poured into this problem since then, and the problems are better understood now.) -glyph
data:image/s3,"s3://crabby-images/e2594/e259423d3f20857071589262f2cb6e7688fbc5bf" alt=""
On 12/10/2011 4:32 PM, Glyph Lefkowitz wrote:
A little data: the HTML5lib project lives at https://code.google.com/p/html5lib/ It has 4 owners and 22 other committers. The most recent release, html5lib 0.90 for Python, is nearly 2 years old. Since there is a separate Python3 repository, and there is no mention on Python3 compatibility elsewhere that I saw, including the pypi listing, I assume that is for Python2 only. A comment on a recent (July 11) Python3 issue https://code.google.com/p/html5lib/issues/detail?id=187&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20Port suggest that the Python3 version still has problems. "Merged in now, though still lots of errors and failures in the testsuite." -- Terry Jan Reedy
data:image/s3,"s3://crabby-images/9dd1d/9dd1dec091b1b438e36e320a5558f7d624f6cb3e" alt=""
On Dec 10, 2011, at 6:30 PM, Terry Reedy wrote:
I believe that you are correct.
I don't see what bearing this has on the discussion. There are three possible ways I can imagine to interpret this information. First, you could believe that porting a codebase from Python 2 to Python 3 is much easier than solving a difficult domain-specific problem. In that case, html5lib has done the hard part and someone interested in html-in-the-stdlib should do the rest. Second, you could believe that porting a codebase from Python 2 to Python 3 is harder than solving a difficult domain-specific problem, in which case something is seriously wrong with Python 3 or its attendant migration tools and that needs to be fixed, so someone should fix that rather than worrying about parsing HTML right now. (I doubt that many subscribers to this list would share this opinion, though.) Third, you could believe that parsing HTML is not a difficult domain-specific problem. But only a crazy person would believe that, so you're left with one of the previous options :). -glyph
data:image/s3,"s3://crabby-images/e2594/e259423d3f20857071589262f2cb6e7688fbc5bf" alt=""
On 12/10/2011 9:25 PM, Glyph Lefkowitz wrote:
On Dec 10, 2011, at 6:30 PM, Terry Reedy wrote:
If there really are 4 'owners' rather than 4 people with admin access to the site, then there are 4 people to negotiate with.
There are issues pointing to a 1.0 release, but I could not find any current timetable. The project lots a bit stagnant. That does not bode well for a commitment to future active maintenance.
I think both points above show that 'absorbing HTML5Lib in the stdlib' will involve more sociological and technical problems than doing so with a active one-person module that already runs on 3.2. One is that the multiple version Python 2.x codebase is the reference version and that will not be incorporated. A serious plan will have to address the real situation. --- Terry Jan Reedy
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Stefan Behnel, 09.12.2011 09:02:
I still think it is, so let me sum up the current discussion here.
It looks like there's agreement on this part.
There was some disagreement on whether MiniDOM should publicly disclose its performance characteristics in the documentation, and whether its use should be discouraged, even just for new users. However, it seemed that there was enough consensus to settle on Nick Coghlan's proposal for a compromise to move ElementTree up to the top of the list, and to add a visible note to the top of each of the XML modules like this: "Note: The <whatever> module is a <yada, yada, DOM based, whatever>. If all you are trying to do is read and write XML files, consider using the xml.etree.ElementTree module instead" That template could (with a bit of peaking into the getopt documentation) be expanded into the following. """ [[Note: The xml.dom.minidom module provides an implementation of the W3C-DOM whose API is similar to that in other programming languages. Users who are unfamiliar with the W3C-DOM interface or who would like to write less code for processing XML files should consider using the xml.etree.ElementTree module instead.]] """ I think this should go on the xml.dom.minidom page as well as the xml.dom package page. Hand-wavingly, users who are new to the DOM are more likely to hit the package page first, whereas those who know it already will likely find the MiniDOM page directly. Note that I'd still encourage the removal of the misleading word "lightweight" until it makes sense to put it back in a meaningful way. I therefore propose the following minimalistic changes to the first paragraph on the minidom page: """ xml.dom.minidom is a [-XXX: light-weight] implementation of the Document Object Model interface. It is intended to be simpler than the full DOM and also [+XXX: provide a] significantly smaller [+XXX: API]. """ @Martin: note how the original paragraph does not refer to "4DOM" or "PyXML". It only generically mentions "the DOM interface". It is certainly not true that MiniDOM is more "light-weight" and "significantly smaller" than (most) other DOM interface implementations outside of the Python world, for example. So the current wording actually makes no sense at all. Additionally, the documentation on the xml.sax page would benefit from the following paragraph: """ [[Note: The xml.sax package provides an implementation of the SAX interface whose API is similar to that in other programming languages. Users who are unfamiliar with the SAX interface or who would like to write less code for efficient stream processing of XML files should consider using the iterparse() function in the xml.etree.ElementTree module instead.]] """ If these changes are considered acceptable, I'll copy the above over to the documentation bug I opened at http://bugs.python.org/issue11379 Can these doc changes go into both 2.7 and 3.3? Given that there is no important difference between the implementations, I don't see why the documentation should differ in Py2.
b) cElementTree should finally loose it's "special" status as a separate library and disappear as an accelerator module behind ElementTree.
There was no opposition and a general agreement on this in the thread, except for the warning that Fredrik Lundh should have a word in this. I wrote him an e-mail and didn't get a response so far. We can wait a little longer, I guess, there's still time before 3.3beta. Stefan
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Fri, Dec 16, 2011 at 4:53 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Your suggested tweaks look good to me and could go into all of 2.7, 3.2 and 3.3
Having ElementTree implicitly do "from _elementtree import *" is a 3.3 only change, though. (Note that xml.etree.cElementTree isn't the actual acceleration module - that honor already goes to "_elementtree". The only bit missing is the automatic import in xml.etree.ElementTree and the appropriate test updates to ensure the Python version still gets tested) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
data:image/s3,"s3://crabby-images/364f8/364f8e111ecb6789169af8be2fa38f22a3648d75" alt=""
Le 16/12/2011 07:53, Stefan Behnel a écrit :
A small caveat to note about iterparse(), which I otherwise like a lot: when processing very big data (I encountered this with a region-wide openstreetmap XML dump), you have to remove the processed nodes from the root element. Otherwise, its memory footprint increases with the size of the document.
data:image/s3,"s3://crabby-images/33866/338662e5c8c36c53d24ab18f081cc3f7f9ce8b18" alt=""
On Fri, Dec 9, 2011 at 10:02, Stefan Behnel <stefan_ml@behnel.de> wrote:
<snip> AFAIU nothing really happened with this. The discussion started with a lot of +1s but then got derailed. The related Issue 11379 also got stuck nearly two months ago. It would be great if some sort of consensus could be reached here, since this is an important issue :-) Eli
data:image/s3,"s3://crabby-images/49c20/49c2071f88d9e728f9d2becf1dbfa7ffd16efd09" alt=""
On Dec 9, 2011 3:04 AM, "Stefan Behnel" <stefan_ml@behnel.de> wrote:
Hi everyone,
I think Py3.3 would be a good milestone for cleaning up the stdlib
support for XML. Note upfront: you may or may not know me as the maintainer of lxml, the de-facto non-stdlib standard Python XML tool. This (lengthy) post was triggered by the following kind of conversation that I keep having with new XML users in Python (mostly on c.l.py), which hints at some serious flaw in the stdlib.
lots of memory. Use the xml.etree.ElementTree package instead, or rather its C implementation cElementTree, also in the stdlib. there are still lots of ancient "Python and XML" web pages out there that date back from the time before Python 2.5 (or rather something like 2.2), when it was the only XML tree library in the stdlib. It's also the first hit from the top when you search for "XML" on the stdlib docs page and contains the (to some people) familiar word "DOM", which lets users stop their search and start writing code, not expecting to find a separate alternative in the same stdlib, way further down. And the description as "mini", "simple" and "lightweight" suggests to users that it's going to be easy to use and efficient.
2) MiniDOM is not what users want. It leads to complicated, unpythonic
code and lots of problems. It is neither easy to use, nor efficient, nor "lightweight", "simple" or "mini", not in absolute numbers (see http://bugs.python.org/issue11379#msg148584 and following for a recent discussion). It's also badly maintained in the sense that its performance characteristics could likely be improved, but no-one is seriously interested in doing that, because it would not lead to something that actually *is* fast or memory friendly compared to any of the 'real' alternatives that are available right now.
3) ElementTree is what users should use, MiniDOM is not. ElementTree was
added to the stdlib in Py2.5 on popular demand, exactly because it is very easy to use, very fast, and very memory friendly. And because users did not want to use MiniDOM any more. Today, ElementTree has a rather straight upgrade path towards lxml.etree if more XML features like validation or XSLT are needed. MiniDOM has nothing like that to offer. It's a dead end.
4) In the stdlib, cElementTree is independent of ElementTree, but totally
hidden in the documentation. In conversations like the above, it's unnecessarily complex to explain to users that there is ElementTree (which is documented in the stdlib), but that what they want to use is really cElementTree, which has the same API but does not have a stdlib documentation page that I can send them to. Note that the other Python implementations simply provide cElementTree as an alias for ElementTree. That leaves CPython as the only Python implementation that really has these two separate modules.
So, there are many problems here. And I think they make it unnecessarily
complicated for users to process XML in Python and that the current situation helps in turning away new users from Python as a language for XML processing. Python does have impressively great tools for working with XML. It's just that the stdlib and its documentation do not reflect or even appreciate that.
What should change?
a) The stdlib documentation should help users to choose the right tool
right from the start. Instead of using the totally misleading wording that it uses now, it should be honest about the performance characteristics of MiniDOM and should actively suggest that those who don't know what to choose (or even *that* they can choose) should not use MiniDOM in the first place. I created a ticket (issue11379) for a minor step in this direction, but given the responses, I'm rather convinced that there's a lot more that can be done and should be done, and that it should be done now, right for the next release.
b) cElementTree should finally loose it's "special" status as a separate
library and disappear as an accelerator module behind ElementTree. This has been suggested a couple of times already, and AFAIR, there was some opposition because 1) ET was maintained outside of the stdlib and 2) the APIs of both were not identical. However, getting ET 1.3 into Py2.7 and 3.2 was a U-turn. Today, ET is *only* being maintained in the stdlib by Florent Xicluna (who is doing a good job with it), and ET 1.3 has basically made the APIs of both implementations compatible again. So, 3.3 would be the right milestone for fixing the "two libs for one" quirk.
Given that this is the third time during the last couple of years that
I'm suggesting to finally fix the stdlib and its documentation, I won't provide any further patches before it has finally been accepted that a) this is a problem and b) it should be fixed, thus allowing the patches to actually serve a purpose. If we can agree on that, I'll happily help in making this change happen.
Stefan
this gets a strong +1 from me and, I suspect, anyone else who spends a significant amount of time in any of the python support communities (python-list, #python, etc). Defaults exist not only in our code, but also in our documentation and presentation, and those defaults are wrong here. _______________________________________________
data:image/s3,"s3://crabby-images/33866/338662e5c8c36c53d24ab18f081cc3f7f9ce8b18" alt=""
On one hand I agree that ET should be emphasized since it's the better API with a much faster implementation. But I also understand Martin's point of view that minidom has its place, so IMHO some sort of compromise should be reached. Perhaps we can recommend using ET for those not specifically interested in the DOM interface, but for those who *are*, minidom is still a good stdlib option (?). Tying this doc clarification with an optimization in minidom is not something that makes sense. This is just delaying a much needed change forever.
This, at least in my view, is the more important point which unfortunately got much less attention in the thread. I was a bit shocked to see that in 3.3 trunk we still have both the Python and C versions exposed and only formally document ElementTree (the Python version), The only reference to cElementTree is an un-emphasized note: A C implementation of this API is available as xml.etree.cElementTree. Is there anything that *really* blocks providing cElementTree on "import ElementTree" and removing the explicit cElementTree for 3.3 (or at least leaving it with a deprecation warning)? Eli
data:image/s3,"s3://crabby-images/e2594/e259423d3f20857071589262f2cb6e7688fbc5bf" alt=""
On 2/6/2012 8:01 AM, Eli Bendersky wrote:
If you can, go ahead and write a patch saying something like that. It should not be hard to come up with something that is a definite improvement. Create a tracker issue for comment. but don't let it sit forever.
Right.
Since the current policy seems to be to hide C behind Python when there is both, I assume that finishing the transition here is something just not gotten around to yet. Open another issue if there is not one.
If cElementTree were renamed _ElementTree for import from ElementTree, then a new cElementTree.py could raise the warning and then import _ElementTree also. -- Terry Jan Reedy
data:image/s3,"s3://crabby-images/33866/338662e5c8c36c53d24ab18f081cc3f7f9ce8b18" alt=""
A tracker issue already exists for this - http://bugs.python.org/issue11379 - I see no reason to open a new one. I will add my opinion there - feel free to do that too.
I will open a separate discussion on this. Eli
data:image/s3,"s3://crabby-images/58a0b/58a0be886f0375938476d3eb7345a8b9d8cdc91e" alt=""
I disagree. The right approach is not to document performance problems, but to fix them.
Unfortunately (?), there is a near-contract-like agreement with Fredrik Lundh that any significant changes to ElementTree in the standard library have to be agreed by him. So whatever change you plan: make sure Fredrik gives his explicit support. Regards, Martin
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
"Martin v. Löwis", 14.12.2011 19:14:
I meant: "lack of interest in improving them". It's clear from the discussion that there are still users and that new code is still being written that uses MiniDOM. However, I would argue that this cannot possibly be performance critical code and that it only deals with somewhat small documents. I say that because MiniDOM is evidently not suitable for large documents or performance critical applications, so this is the only explanation I have why the performance problems would not be obvious in the cases where it is still being used. And if they do show, it appears to be much more likely that users rewrite their code using ElementTree or lxml than that they try to fix MiniDOM's performance issues. Now, read my first quote above again (and preferably also its context, which I already emphasized in a previous post), it should be clearer now. Stefan
data:image/s3,"s3://crabby-images/46dc6/46dc618d3e52171111ae75db482ab8f02667c0e6" alt=""
On 2011-12-14, at 20:41 , Stefan Behnel wrote:
I meant: "lack of interest in improving them". It's clear from the discussion that there are still users and that new code is still being written that uses MiniDOM. However, I would argue that this cannot possibly be performance critical code and that it only deals with somewhat small documents. I say that because MiniDOM is evidently not suitable for large documents or performance critical applications, so this is the only explanation I have why the performance problems would not be obvious in the cases where it is still being used. And if they do show, it appears to be much more likely that users rewrite their code using ElementTree or lxml than that they try to fix MiniDOM's performance issues. Could also be because "XML is slow (and sucks)" is part of the global consciousness at this point, and that minidom is slow and verbose doesn't surprise much.
data:image/s3,"s3://crabby-images/58a0b/58a0be886f0375938476d3eb7345a8b9d8cdc91e" alt=""
Am 14.12.2011 20:41, schrieb Stefan Behnel:
That's also what I meant. I'm interested in improving them.
Now, read my first quote above again (and preferably also its context, which I already emphasized in a previous post), it should be clearer now.
I (now) know what you mean - but you are incorrect. Regards, Martin
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
"Martin v. Löwis", 14.12.2011 22:20:
Then please do. I posted the numbers, so you know what the baseline is, both in terms of speed and memory usage. If you need further benchmarks of other areas of the API (e.g. tag search or whatever), just ask. Note, however, that even an improvement by an order of magnitude wouldn't solve the API issue for new users, so I'd still suggest to add an appropriate link towards ET to the MiniDOM documentation. Stefan
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Stefan Behnel, 14.12.2011 20:41:
Out of curiosity, I reran my benchmarks under PyPy 1.7. http://blog.behnel.de/index.php?p=210 In short: MiniDOM performs substantially better there, both in terms of time and space. That by itself doesn't make PyPy an interesting platform for XML processing (using lxml in CPython is way faster), but I found it interesting to note that the problem is not strictly inherent in MiniDOM. It also depends a lot on the runtime environment, even when it comes to memory usage. Stefan
data:image/s3,"s3://crabby-images/9f3d7/9f3d7dabb1d64eb02f4810f65bf815f8e703157b" alt=""
On 2011-12-09, at 09:41 , Martin v. Löwis wrote:
Minidom is inferior in interface flow and pythonicity, in terseness, in speed, in memory consumption (even more so using cElementTree, and that's not something which can be fixed unless minidom gets a C accelerator), etc… Even after fixing minidom (if anybody has the time and drive to commit to it), ET/cET should be preferred over it. And that's not even considering the ease of switching to lxml (if only for validators), which Stefan outlined. [0] not 100% true now that I think about it: handling mixed content is simpler in minidom as there is no .text/.tail duality and text nodes are nodes like every other, but I really can't think of an other reason to prefer minidom
data:image/s3,"s3://crabby-images/58a0b/58a0be886f0375938476d3eb7345a8b9d8cdc91e" alt=""
Am 09.12.2011 10:09, schrieb Xavier Morel:
I don't mind pointing people to ElementTree, despite that I disagree whether the ET interface is "superior" to DOM. It's Stefan's reasoning as to *why* people should be pointed to ET, and what words should be used to do that. IOW, I detest bashing some part of the standard library, just to urge users to use some other part of the standard library. People are still using PyXML, despite it's not being maintained anymore. Telling them to replace 4DOM with minidom is much more appropriate than telling them to rewrite in ET. Regards, Martin
data:image/s3,"s3://crabby-images/9f3d7/9f3d7dabb1d64eb02f4810f65bf815f8e703157b" alt=""
On 2011-12-11, at 23:03 , Martin v. Löwis wrote:
From my understanding, Stefan's suggestion is mostly aimed at "new" python users trying to manipulate XML and not knowing what to use (yet). It's not about telling people to rewrite existing codebase (it's a good idea as well when possible, as far as I'm concerned, but it's a different issue).
data:image/s3,"s3://crabby-images/dd81a/dd81a0b0c00ff19c165000e617f6182a8ea63313" alt=""
Martin, You seem heavily invested in minidom. In the near future I will need to parse and rewrite parts of an xml file created by a third-party program (PrintShopMail, for the curious). It contains both binary and textual data. Would you recommend minidom for this purpose? What other purposes would you recommend minidom for? xml-confused-ly yours, ~Ethan~ (Comments by others are, of course, also welcome. :)
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
"Martin v. Löwis", 11.12.2011 23:03:
Yes, that's clearly a point where we agree to disagree, and I understand that you are as biased towards minidom as I am biased towards ElementTree. However, I think I made it clear that the implementation of cElementTree (and lxml.etree as well, for that purpose) is largely superiour to MiniDOM in terms of performance, for any sensible meaning of the word performance. And I'm also convinced that the API is largely superiour in terms of usability. ET certainly matches Python as a language much better than MiniDOM. But that's just my personal opinion.
I'm all for finding a good way of putting it into words, as long as it keeps uninformed users from taking the wrong decision and getting the wrong idea of how complicated and slow Python is.
People are still using PyXML, despite it's not being maintained anymore.
My experience with that is that it's only *new* users that are still running into PyXML by accident, because they didn't see that it's a dead project and they find it through ancient web pages that tell them that they need it because "it's the way to do XML in Python" and "if minidom is not enough, use PyXML". Maybe we should "misuse" the stdlib documentation to clear that up as well. "PyXML" is just too attractive a name for a dead project. Just look through the xml-sig page, basically all requests regarding PyXML during the last five years deal with problems in installing it, i.e. *before* even starting to use it. So you can't use this to claim that people really *are* still using it.
Telling them to replace 4DOM with minidom is much more appropriate
Do you actually have any evidence that anyone is still actively using 4DOM?
than telling them to rewrite in ET.
I usually encourage people to rewrite minidom code for ET. It makes the code simpler, more readable, more maintainable and much faster. Stefan
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Fri, Dec 9, 2011 at 6:41 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
When we offer a better way to do something that new users are want to do, we generally redirect them to the more recent alternative. I believe the redirection from the getopt module to the argparse module strikes the right tone for that kind of thing: http://docs.python.org/library/getopt For the various XML libraries, a message along the lines of "Note: The <whatever> module is a <yada, yada, DOM based, whatever>. If all you are trying to do is read and write XML files, consider using the xml.etree.ElementTree module instead". I'd also be +1 on adjusting the order of the XML pages in the main index such that xml.etree.ElementTree appeared before xml.parser.expat and all the others slid down one entry. These are simple changes that don't harm current users of the modules in the least, while being up front and very helpful for beginners. Again, I think argparse vs getopt is a good comparison: argparse appears first in the main index, and there's a redirection from getopt to argparse that says "if you don't have a specific reason to be using getopt, you probably want argparse instead". -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
data:image/s3,"s3://crabby-images/8a956/8a956617e1584d7190ba6141649d4f0c0a5901cb" alt=""
On Fri, Dec 9, 2011 at 09:02, Stefan Behnel <stefan_ml@behnel.de> wrote:
An at least somewhat informed +1 from me. The ElementTree API is a very good way to deal with XML from Python, and it deserves to be promoted over the included alternatives. Let's deprecate the NiCad batteries and try to guide users toward the Li-Ion ones. Cheers, Dirkjan
data:image/s3,"s3://crabby-images/9f3d0/9f3d02f3375786c1b9e625fe336e3e9dfd7b0234" alt=""
On Fri, 09 Dec 2011 09:02:35 +0100 Stefan Behnel <stefan_ml@behnel.de> wrote:
+1 and +1. I've done a lot of xml work in Python, and unless you've got a particular reason for wanting to use the dom, ElementTree is the only sane way to go. I recently converted a middling-sized app from using the dom to using ElementTree, and wrote up some guidelines for the process for the client. I can try and shake it out of my clients lawyers if it would help with this or others are interested. <mike
data:image/s3,"s3://crabby-images/b2012/b20127a966d99eea8598511fc82e29f8d180df6c" alt=""
Mike Meyer <mwm@mired.org> wrote:
I use ElementTree for parsing valid XML, but minidom for producing it. I think another thing that might go into "refreshing the batteries" is a feature comparison of BeautifulSoup and HTML5lib against the stdlib competition, to see what needs to be added/revised. Having to switch to an outside package for parsing possibly invalid HTML is a pain. Bill
data:image/s3,"s3://crabby-images/8e91b/8e91bd2597e9c25a0a8c3497599699707003a9e9" alt=""
On 9 December 2011 18:15, Bill Janssen <janssen@parc.com> wrote:
For what little use I make of XML/HTML parsing, I use lxml, simply because it has a parser that covers the sort of HTML I have to deal with in real life. As I have lxml installed, I use it for any XML parsing tasks, just because I'm used to it. Paul
data:image/s3,"s3://crabby-images/b2012/b20127a966d99eea8598511fc82e29f8d180df6c" alt=""
Xavier Morel <python-dev@masklinn.net> wrote:
Inertia, I guess. I tried that first, and it seems to work. I tend to use html5lib and/or BeautifulSoup instead of ElementTree, and that's mainly because I find the documentation for ElementTree is confusing and partial and inconsistent. Having various undated but obsolete tutorials and documentation still up on effbot.org doesn't help. Bill
data:image/s3,"s3://crabby-images/33866/338662e5c8c36c53d24ab18f081cc3f7f9ce8b18" alt=""
On Sat, Dec 10, 2011 at 00:43, Matt Joiner <anacrolix@gmail.com> wrote:
I second this. The doco is very bad.
It would be constructive to open issues for specific problems in the documentation. I'm sure this won't be hard to fix. Documentation should not be the roadblock for using a library. Eli
data:image/s3,"s3://crabby-images/89582/895822069909929e7f9b43941b4a659ad715607a" alt=""
On Fri, 2011-12-09 at 19:39 +0100, Xavier Morel wrote:
To throw my 2c in here: I personally normally use minidom for manipulating (x)html data (through html5lib), and for writing XML. I think it's primarily because DOM: a) matches the way I think about XML documents. b) Provides the same API as I use in other languages. (FWIW, I do a lot of DOM manipulation in javascript) c) "Feels" (to me) more similar to other formats I work with. All three may be because I haven't spent enough time with ElementTree - again I've found the documentation lacking. Tim
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Bill Janssen, 09.12.2011 19:15:
Such a feature request should be worth a separate thread. Note, however, that html5lib is likely way too big to add it to the stdlib, and that BeautifulSoup lacks a parser for non-conforming HTML in Python 3, which would be the target release series for better HTML support. So, whatever library or API you would want to use for HTML processing is currently only the second question as long as Py3 lacks a real-world HTML parser in the stdlib, as well as a robust character detection mechanism. I don't think that can be fixed all that easily. Stefan
data:image/s3,"s3://crabby-images/b2012/b20127a966d99eea8598511fc82e29f8d180df6c" alt=""
Stefan Behnel <stefan_ml@behnel.de> wrote:
Sounds like it needs a PEP. I'm only advocating spending some thought on what needs to be done -- whether outside libraries need to be adopted into the stdlib would be a step after that. But understanding *why* those libraries exist and are widely used should be a prerequisite to "refreshing" the stdlib's support. Bill
data:image/s3,"s3://crabby-images/9dd1d/9dd1dec091b1b438e36e320a5558f7d624f6cb3e" alt=""
On Dec 10, 2011, at 2:38 AM, Stefan Behnel wrote:
Note, however, that html5lib is likely way too big to add it to the stdlib, and that BeautifulSoup lacks a parser for non-conforming HTML in Python 3, which would be the target release series for better HTML support. So, whatever library or API you would want to use for HTML processing is currently only the second question as long as Py3 lacks a real-world HTML parser in the stdlib, as well as a robust character detection mechanism. I don't think that can be fixed all that easily.
Here's the problem in a nutshell, I think: Everybody wants an HTML parser in the stdlib, because it's inconvenient to pull in a dependency for such a "simple" task. Everybody wants the stdlib to remain small, stable, and simple and not get "overcomplicated". Parsing arbitrary HTML5 is a monstrously complex problem, for which there exist rapidly-evolving standards and libraries to deal with it. Parsing 'the web' (which is rapidly growing to include stuff like SVG, MathML etc) is even harder. My personal opinion is that HTML5Lib gets this problem almost completely right, and so it should be absorbed by the stdlib. Trying to re-invent this from scratch, or even use something like BeautifulSoup which uses a bunch of heuristics and hacks rather than reference to the laboriously-crafted standard that says exactly how parsing malformed stuff has to go to be "like a browser", seems like it will just give the stdlib solution a reputation for working on the test input but not working in the real world. (No disrespect to BeautifulSoup: it was a great attempt in the pre-HTML5 world which it was born into, and I've used it numerous times to implement useful things. But much more effort has been poured into this problem since then, and the problems are better understood now.) -glyph
data:image/s3,"s3://crabby-images/e2594/e259423d3f20857071589262f2cb6e7688fbc5bf" alt=""
On 12/10/2011 4:32 PM, Glyph Lefkowitz wrote:
A little data: the HTML5lib project lives at https://code.google.com/p/html5lib/ It has 4 owners and 22 other committers. The most recent release, html5lib 0.90 for Python, is nearly 2 years old. Since there is a separate Python3 repository, and there is no mention on Python3 compatibility elsewhere that I saw, including the pypi listing, I assume that is for Python2 only. A comment on a recent (July 11) Python3 issue https://code.google.com/p/html5lib/issues/detail?id=187&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20Port suggest that the Python3 version still has problems. "Merged in now, though still lots of errors and failures in the testsuite." -- Terry Jan Reedy
data:image/s3,"s3://crabby-images/9dd1d/9dd1dec091b1b438e36e320a5558f7d624f6cb3e" alt=""
On Dec 10, 2011, at 6:30 PM, Terry Reedy wrote:
I believe that you are correct.
I don't see what bearing this has on the discussion. There are three possible ways I can imagine to interpret this information. First, you could believe that porting a codebase from Python 2 to Python 3 is much easier than solving a difficult domain-specific problem. In that case, html5lib has done the hard part and someone interested in html-in-the-stdlib should do the rest. Second, you could believe that porting a codebase from Python 2 to Python 3 is harder than solving a difficult domain-specific problem, in which case something is seriously wrong with Python 3 or its attendant migration tools and that needs to be fixed, so someone should fix that rather than worrying about parsing HTML right now. (I doubt that many subscribers to this list would share this opinion, though.) Third, you could believe that parsing HTML is not a difficult domain-specific problem. But only a crazy person would believe that, so you're left with one of the previous options :). -glyph
data:image/s3,"s3://crabby-images/e2594/e259423d3f20857071589262f2cb6e7688fbc5bf" alt=""
On 12/10/2011 9:25 PM, Glyph Lefkowitz wrote:
On Dec 10, 2011, at 6:30 PM, Terry Reedy wrote:
If there really are 4 'owners' rather than 4 people with admin access to the site, then there are 4 people to negotiate with.
There are issues pointing to a 1.0 release, but I could not find any current timetable. The project lots a bit stagnant. That does not bode well for a commitment to future active maintenance.
I think both points above show that 'absorbing HTML5Lib in the stdlib' will involve more sociological and technical problems than doing so with a active one-person module that already runs on 3.2. One is that the multiple version Python 2.x codebase is the reference version and that will not be incorporated. A serious plan will have to address the real situation. --- Terry Jan Reedy
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Stefan Behnel, 09.12.2011 09:02:
I still think it is, so let me sum up the current discussion here.
It looks like there's agreement on this part.
There was some disagreement on whether MiniDOM should publicly disclose its performance characteristics in the documentation, and whether its use should be discouraged, even just for new users. However, it seemed that there was enough consensus to settle on Nick Coghlan's proposal for a compromise to move ElementTree up to the top of the list, and to add a visible note to the top of each of the XML modules like this: "Note: The <whatever> module is a <yada, yada, DOM based, whatever>. If all you are trying to do is read and write XML files, consider using the xml.etree.ElementTree module instead" That template could (with a bit of peaking into the getopt documentation) be expanded into the following. """ [[Note: The xml.dom.minidom module provides an implementation of the W3C-DOM whose API is similar to that in other programming languages. Users who are unfamiliar with the W3C-DOM interface or who would like to write less code for processing XML files should consider using the xml.etree.ElementTree module instead.]] """ I think this should go on the xml.dom.minidom page as well as the xml.dom package page. Hand-wavingly, users who are new to the DOM are more likely to hit the package page first, whereas those who know it already will likely find the MiniDOM page directly. Note that I'd still encourage the removal of the misleading word "lightweight" until it makes sense to put it back in a meaningful way. I therefore propose the following minimalistic changes to the first paragraph on the minidom page: """ xml.dom.minidom is a [-XXX: light-weight] implementation of the Document Object Model interface. It is intended to be simpler than the full DOM and also [+XXX: provide a] significantly smaller [+XXX: API]. """ @Martin: note how the original paragraph does not refer to "4DOM" or "PyXML". It only generically mentions "the DOM interface". It is certainly not true that MiniDOM is more "light-weight" and "significantly smaller" than (most) other DOM interface implementations outside of the Python world, for example. So the current wording actually makes no sense at all. Additionally, the documentation on the xml.sax page would benefit from the following paragraph: """ [[Note: The xml.sax package provides an implementation of the SAX interface whose API is similar to that in other programming languages. Users who are unfamiliar with the SAX interface or who would like to write less code for efficient stream processing of XML files should consider using the iterparse() function in the xml.etree.ElementTree module instead.]] """ If these changes are considered acceptable, I'll copy the above over to the documentation bug I opened at http://bugs.python.org/issue11379 Can these doc changes go into both 2.7 and 3.3? Given that there is no important difference between the implementations, I don't see why the documentation should differ in Py2.
b) cElementTree should finally loose it's "special" status as a separate library and disappear as an accelerator module behind ElementTree.
There was no opposition and a general agreement on this in the thread, except for the warning that Fredrik Lundh should have a word in this. I wrote him an e-mail and didn't get a response so far. We can wait a little longer, I guess, there's still time before 3.3beta. Stefan
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Fri, Dec 16, 2011 at 4:53 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Your suggested tweaks look good to me and could go into all of 2.7, 3.2 and 3.3
Having ElementTree implicitly do "from _elementtree import *" is a 3.3 only change, though. (Note that xml.etree.cElementTree isn't the actual acceleration module - that honor already goes to "_elementtree". The only bit missing is the automatic import in xml.etree.ElementTree and the appropriate test updates to ensure the Python version still gets tested) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
data:image/s3,"s3://crabby-images/364f8/364f8e111ecb6789169af8be2fa38f22a3648d75" alt=""
Le 16/12/2011 07:53, Stefan Behnel a écrit :
A small caveat to note about iterparse(), which I otherwise like a lot: when processing very big data (I encountered this with a region-wide openstreetmap XML dump), you have to remove the processed nodes from the root element. Otherwise, its memory footprint increases with the size of the document.
data:image/s3,"s3://crabby-images/33866/338662e5c8c36c53d24ab18f081cc3f7f9ce8b18" alt=""
On Fri, Dec 9, 2011 at 10:02, Stefan Behnel <stefan_ml@behnel.de> wrote:
<snip> AFAIU nothing really happened with this. The discussion started with a lot of +1s but then got derailed. The related Issue 11379 also got stuck nearly two months ago. It would be great if some sort of consensus could be reached here, since this is an important issue :-) Eli
data:image/s3,"s3://crabby-images/49c20/49c2071f88d9e728f9d2becf1dbfa7ffd16efd09" alt=""
On Dec 9, 2011 3:04 AM, "Stefan Behnel" <stefan_ml@behnel.de> wrote:
Hi everyone,
I think Py3.3 would be a good milestone for cleaning up the stdlib
support for XML. Note upfront: you may or may not know me as the maintainer of lxml, the de-facto non-stdlib standard Python XML tool. This (lengthy) post was triggered by the following kind of conversation that I keep having with new XML users in Python (mostly on c.l.py), which hints at some serious flaw in the stdlib.
lots of memory. Use the xml.etree.ElementTree package instead, or rather its C implementation cElementTree, also in the stdlib. there are still lots of ancient "Python and XML" web pages out there that date back from the time before Python 2.5 (or rather something like 2.2), when it was the only XML tree library in the stdlib. It's also the first hit from the top when you search for "XML" on the stdlib docs page and contains the (to some people) familiar word "DOM", which lets users stop their search and start writing code, not expecting to find a separate alternative in the same stdlib, way further down. And the description as "mini", "simple" and "lightweight" suggests to users that it's going to be easy to use and efficient.
2) MiniDOM is not what users want. It leads to complicated, unpythonic
code and lots of problems. It is neither easy to use, nor efficient, nor "lightweight", "simple" or "mini", not in absolute numbers (see http://bugs.python.org/issue11379#msg148584 and following for a recent discussion). It's also badly maintained in the sense that its performance characteristics could likely be improved, but no-one is seriously interested in doing that, because it would not lead to something that actually *is* fast or memory friendly compared to any of the 'real' alternatives that are available right now.
3) ElementTree is what users should use, MiniDOM is not. ElementTree was
added to the stdlib in Py2.5 on popular demand, exactly because it is very easy to use, very fast, and very memory friendly. And because users did not want to use MiniDOM any more. Today, ElementTree has a rather straight upgrade path towards lxml.etree if more XML features like validation or XSLT are needed. MiniDOM has nothing like that to offer. It's a dead end.
4) In the stdlib, cElementTree is independent of ElementTree, but totally
hidden in the documentation. In conversations like the above, it's unnecessarily complex to explain to users that there is ElementTree (which is documented in the stdlib), but that what they want to use is really cElementTree, which has the same API but does not have a stdlib documentation page that I can send them to. Note that the other Python implementations simply provide cElementTree as an alias for ElementTree. That leaves CPython as the only Python implementation that really has these two separate modules.
So, there are many problems here. And I think they make it unnecessarily
complicated for users to process XML in Python and that the current situation helps in turning away new users from Python as a language for XML processing. Python does have impressively great tools for working with XML. It's just that the stdlib and its documentation do not reflect or even appreciate that.
What should change?
a) The stdlib documentation should help users to choose the right tool
right from the start. Instead of using the totally misleading wording that it uses now, it should be honest about the performance characteristics of MiniDOM and should actively suggest that those who don't know what to choose (or even *that* they can choose) should not use MiniDOM in the first place. I created a ticket (issue11379) for a minor step in this direction, but given the responses, I'm rather convinced that there's a lot more that can be done and should be done, and that it should be done now, right for the next release.
b) cElementTree should finally loose it's "special" status as a separate
library and disappear as an accelerator module behind ElementTree. This has been suggested a couple of times already, and AFAIR, there was some opposition because 1) ET was maintained outside of the stdlib and 2) the APIs of both were not identical. However, getting ET 1.3 into Py2.7 and 3.2 was a U-turn. Today, ET is *only* being maintained in the stdlib by Florent Xicluna (who is doing a good job with it), and ET 1.3 has basically made the APIs of both implementations compatible again. So, 3.3 would be the right milestone for fixing the "two libs for one" quirk.
Given that this is the third time during the last couple of years that
I'm suggesting to finally fix the stdlib and its documentation, I won't provide any further patches before it has finally been accepted that a) this is a problem and b) it should be fixed, thus allowing the patches to actually serve a purpose. If we can agree on that, I'll happily help in making this change happen.
Stefan
this gets a strong +1 from me and, I suspect, anyone else who spends a significant amount of time in any of the python support communities (python-list, #python, etc). Defaults exist not only in our code, but also in our documentation and presentation, and those defaults are wrong here. _______________________________________________
data:image/s3,"s3://crabby-images/33866/338662e5c8c36c53d24ab18f081cc3f7f9ce8b18" alt=""
On one hand I agree that ET should be emphasized since it's the better API with a much faster implementation. But I also understand Martin's point of view that minidom has its place, so IMHO some sort of compromise should be reached. Perhaps we can recommend using ET for those not specifically interested in the DOM interface, but for those who *are*, minidom is still a good stdlib option (?). Tying this doc clarification with an optimization in minidom is not something that makes sense. This is just delaying a much needed change forever.
This, at least in my view, is the more important point which unfortunately got much less attention in the thread. I was a bit shocked to see that in 3.3 trunk we still have both the Python and C versions exposed and only formally document ElementTree (the Python version), The only reference to cElementTree is an un-emphasized note: A C implementation of this API is available as xml.etree.cElementTree. Is there anything that *really* blocks providing cElementTree on "import ElementTree" and removing the explicit cElementTree for 3.3 (or at least leaving it with a deprecation warning)? Eli
data:image/s3,"s3://crabby-images/e2594/e259423d3f20857071589262f2cb6e7688fbc5bf" alt=""
On 2/6/2012 8:01 AM, Eli Bendersky wrote:
If you can, go ahead and write a patch saying something like that. It should not be hard to come up with something that is a definite improvement. Create a tracker issue for comment. but don't let it sit forever.
Right.
Since the current policy seems to be to hide C behind Python when there is both, I assume that finishing the transition here is something just not gotten around to yet. Open another issue if there is not one.
If cElementTree were renamed _ElementTree for import from ElementTree, then a new cElementTree.py could raise the warning and then import _ElementTree also. -- Terry Jan Reedy
data:image/s3,"s3://crabby-images/33866/338662e5c8c36c53d24ab18f081cc3f7f9ce8b18" alt=""
A tracker issue already exists for this - http://bugs.python.org/issue11379 - I see no reason to open a new one. I will add my opinion there - feel free to do that too.
I will open a separate discussion on this. Eli
participants (18)
-
"Martin v. Löwis"
-
Antoine Pitrou
-
Baptiste Carvello
-
Bill Janssen
-
Calvin Spealman
-
Dirkjan Ochtman
-
Eli Bendersky
-
Ethan Furman
-
Glyph Lefkowitz
-
Matt Joiner
-
Mike Meyer
-
Nick Coghlan
-
Paul Moore
-
Stefan Behnel
-
Terry Reedy
-
Tim Wintle
-
Xavier Morel
-
Xavier Morel