[issue11379] Remove "lightweight" from minidom description
New submission from Stefan Behnel
Martin v. Löwis
Stefan Behnel
Martin v. Löwis
What about this phrasing then:
""" MiniDOM has a smaller memory footprint than some of the other DOM compliant implementations for Python (such as 4DOM), but uses about 10x more memory than the faster and simpler xml.etree.cElementTree module. """
But that's not a DOM implementation - so it would be comparing apples
and oranges.
----------
_______________________________________
Python tracker
Stefan Behnel
Martin v. Löwis
It's the tree based API most python users are parsing XML with, though. So I do not agree that it's comparing apples and oranges, not at all. It's comparing tree based XML libraries, only one of which is worth being called "light weight", and that's not the one that is currently carrying that name.
If that is a real concern, I'd rather reduce the memory footprint of
minidom than put actual performance figures into the documentation
that will likely outdate over time.
Notice that the documentation doesn't claim that it is a lightweight
XML library, only that it's a ligthweight DOM implementation. SAX is,
of course, even lighter-weight.
----------
_______________________________________
Python tracker
Stefan Behnel
If that is a real concern, I'd rather reduce the memory footprint of minidom than put actual performance figures into the documentation that will likely outdate over time.
Personally, I do not think it's worth putting much work into MiniDOM. I'd rather deprecate it to prevent new code from being written for it, but that's just my personal opinion, and this is the wrong place to discuss that. Given the current performance characteristics, I wouldn't be surprised if there was quite some room for improvements left in the xml.dom package. If you dislike the "10x", feel free to use "several times". I doubt that MiniDOM will ever get so much closer to cET and lxml to prove that phrasing wrong.
Notice that the documentation doesn't claim that it is a lightweight XML library, only that it's a ligthweight DOM implementation.
I imagine that you are as aware as I am that this nuance is easy to miss, especially for a new user. From my experience, it is very common for users, especially those with a Java-ish background, to confuse the terms "DOM" and "XML tree API/library". Hence my push to change the documentation.
SAX is, of course, even lighter-weight.
Not so much more light weight than cET's iterparse(), but that's getting OT here.
Stefan
----------
_______________________________________
Python tracker
Antoine Pitrou
Stefan Behnel
Éric Araujo
Stefan Behnel
Ezio Melotti
Fred L. Drake, Jr.
Stefan Behnel
Antoine Pitrou
I don't think "FUD" is a suitable term for the rather minidom-friendly wording in my last proposal. Seriously, minidom is widely known for being extremely slow and extremely memory hungry. And that is backed by basically any benchmark that has ever been done on the subject.
If it's both slow and memory-hungry, perhaps use the more generic
"performance" instead of "memory footprint"?
----------
_______________________________________
Python tracker
Ezio Melotti
Seriously, minidom is widely known for being extremely slow and extremely memory hungry. And that is backed by basically any benchmark that has ever been done on the subject.
Do you have any link?
My point is that if you say thing like "significantly/several times higher memory footprint than X" you are basically scaring the users away from the module. If for an average documents it takes, say, 30-50MB of memory, it seems perfectly reasonable to me, even if ElementTree takes 3-5MB. I would actually consider 100-200MB still ok too, unless I have to parse lot of documents or I'm running low of memory for other reasons.
----------
_______________________________________
Python tracker
Antoine Pitrou
My point is that if you say thing like "significantly/several times higher memory footprint than X" you are basically scaring the users away from the module.
Only those users who know they'll be processing significantly large documents. I don't think "scaring away people" is a good enough reason *not* to document performance characteristics. For example, we already mention that string joining is faster than repeated concatenation; I haven't heard anyone complain that it scared people away from string concatenation. And while it's true that we shouldn't try to document performance characteristics *too precisely*, it is still a good thing to document the most outstanding facts (for examples, C accelerator modules are clearly superior in performance to pure Python modules; should we shy away from documenting that, and instead present it as some kind of neutral choice?). And, of course, if minidom gets some serious performance attention, the claims will have to be revisited. But given the amount of attention minidom gets at all, it sounds rather implausible.
If for an average documents it takes, say, 30-50MB of memory, it seems perfectly reasonable to me, even if ElementTree takes 3-5MB. I would actually consider 100-200MB still ok too
Some use cases would not really like a 100-200MB memory consumption, or
even 50MB. Think a long-running daemon, for instance.
----------
_______________________________________
Python tracker
Stefan Behnel
Seriously, minidom is widely known for being extremely slow and extremely memory hungry. And that is backed by basically any benchmark that has ever been done on the subject.
Do you have any link?
I just did a quick Google search for "python minidom benchmark" and found
these:
http://www.opensourcetutorials.com/tutorials/Server-Side-Coding/Python/xml-m...
http://effbot.org/zone/celementtree.htm#benchmarks
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
Note that all three authors risk being biased, but given how similar the
results are, I tend to believe them.
Stefan
----------
_______________________________________
Python tracker
Antoine Pitrou
I just did a quick Google search for "python minidom benchmark" and found these:
http://www.opensourcetutorials.com/tutorials/Server-Side-Coding/Python/xml-m...
http://effbot.org/zone/celementtree.htm#benchmarks
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
Note that all three authors risk being biased, but given how similar the results are, I tend to believe them.
Thanks for the links. The performance gap looks significant enough to be
mentioned, at least generically.
----------
_______________________________________
Python tracker
Stefan Behnel
Changes by Florent Xicluna
Changes by Florent Xicluna
Stefan Behnel
Stefan Behnel
Ezio Melotti
xml.dom.minidom is a [-XXX: light-weight] implementation of the Document Object Model interface.
This is ok.
It is intended to be simpler than the full DOM and also [+XXX: provide a] significantly smaller [+XXX: API].
Doesn't "simpler" here refer to the API already?
Another option is to add somewhere a section like:
"If you have to work with XML, ElementTree is usually the best choice, because it has a simple API and it's efficient [or whatever]. xml.dom.minidom provides a subset of the W3C-DOM API, and xml.sax a SAX interface.", possibly expanding a bit on the differences and showing a minimal example with the 3 different implementations, and then link to it from the other modules' pages.
----------
_______________________________________
Python tracker
Martin v. Löwis
"If you have to work with XML, ElementTree is usually the best choice, because it has a simple API and it's efficient [or whatever].
I still object such a wording, for many reasons.
----------
_______________________________________
Python tracker
Eli Bendersky
Changes by Tshepang Lekhonkhobe
Éric Araujo
Eli Bendersky
Changes by Ezio Melotti
Martin v. Löwis
Eli Bendersky
Éric Araujo
Eli Bendersky
I’m not sure I would use note markup, though (cf. Raymond’s aversion to littering the doc with note and warning boxes).
I also dislike box littering, but this one seems like a really good fit for
a note, since it's completely outside the flow of that documentation page.
----------
_______________________________________
Python tracker
Raymond Hettinger
Roundup Robot
Roundup Robot
Eli Bendersky
Stefan Behnel
Eli Bendersky
Éric Araujo
Antoine Pitrou
I think I’ve always understood “lightweight” to mean “minimal”.
Then how about saying "minimal" instead of "lightweight"?
(also, it seems it really means "incomplete" or "partial", which are of course less positive sounding)
----------
_______________________________________
Python tracker
Ezio Melotti
Éric Araujo
Éric Araujo
Stefan Behnel
Éric Araujo
Roundup Robot
Roundup Robot
Éric Araujo
Martin v. Löwis
FYI, note that http://wiki.python.org/moin/MiniDom says this about minidom: “slow and very memory hungry DOM implementation”.
Thanks for the notice; I have now fixed that wording.
----------
_______________________________________
Python tracker
Eli Bendersky
Stefan Behnel
Eli Bendersky
Éric Araujo
Ezio Melotti
Éric Araujo
Changes by Eli Bendersky
Stefan Behnel added the comment:
Any news on this?
----------
_______________________________________
Python tracker
Stefan Behnel added the comment:
I'm not sure if it's a good idea to keep bikeshedding about this for another two years. Personally, I would prefer having someone with commit rights fix this and be done with it.
Eric's last patch looks ok and parts of it went in already, so it's mostly just the heading that remains to be fixed.
----------
versions: +Python 3.4
_______________________________________
Python tracker
Antoine Pitrou added the comment:
Someone should go ahead and apply this. Éric, perhaps?
----------
stage: needs patch -> commit review
_______________________________________
Python tracker
Éric Araujo added the comment:
Sure, feel free to commit this.
----------
_______________________________________
Python tracker
Roundup Robot added the comment:
New changeset c2ae1ed03853 by Ezio Melotti in branch '2.7':
#11379: rephrase minidom documentation to use the term "minimal" instead of "lightweight". Patch by Éric Araujo.
http://hg.python.org/cpython/rev/c2ae1ed03853
New changeset b9c0e050c935 by Ezio Melotti in branch '3.2':
#11379: rephrase minidom documentation to use the term "minimal" instead of "lightweight". Patch by Éric Araujo.
http://hg.python.org/cpython/rev/b9c0e050c935
New changeset 8ff512910338 by Ezio Melotti in branch '3.3':
#11379: merge with 3.2.
http://hg.python.org/cpython/rev/8ff512910338
New changeset 9a0cd5363c2a by Ezio Melotti in branch 'default':
#11379: merge with 3.3.
http://hg.python.org/cpython/rev/9a0cd5363c2a
----------
_______________________________________
Python tracker
Ezio Melotti added the comment:
Fixed, thanks for the patch!
----------
assignee: docs@python -> ezio.melotti
resolution: -> fixed
stage: commit review -> committed/rejected
status: open -> closed
type: performance -> enhancement
_______________________________________
Python tracker
Roundup Robot added the comment:
New changeset 39ea24aaf0e7 by Antoine Pitrou in branch '2.7':
s/lightweight/minimal/, as per issue #11379.
http://hg.python.org/cpython/rev/39ea24aaf0e7
New changeset b63258b6eb4d by Antoine Pitrou in branch '3.3':
s/lightweight/minimal/, as per issue #11379.
http://hg.python.org/cpython/rev/b63258b6eb4d
New changeset d659e7761d59 by Antoine Pitrou in branch 'default':
s/lightweight/minimal/, as per issue #11379.
http://hg.python.org/cpython/rev/d659e7761d59
----------
_______________________________________
Python tracker
participants (12)
-
Antoine Pitrou
-
Eli Bendersky
-
Ezio Melotti
-
Florent Xicluna
-
Fred L. Drake, Jr.
-
Martin v. Löwis
-
Raymond Hettinger
-
Roundup Robot
-
Senthil Kumaran
-
Stefan Behnel
-
Tshepang Lekhonkhobe
-
Éric Araujo