[lxml-dev] SimpleXMLWriter vs. lxml performance
Hi. On my blog I mentioned that I was building up a tree from the portal catalog and that ElementTree's SimpleXMLWriter was 2.5 times faster than doing it via lxml. Martijn asked me to post here the code I used, as it shouldn't have been such a large difference. He was right -- sorry!!!. In the original comparison, for the lxml part I turned the pathindex string into subnodes for paths but didn't do this for the XMLWriter part. In a 1:1 comparison, lxml is actually a bit faster. For those interested, my test is below, with the results in the docstring. I'll post a correction in my blog post. (Note: I'm using the scoder2 branch.) --Paul """ A simple speed comparison of SimpleXMLWriter versus lxml for DOM node creation. Results for 50 entries: 0.0206758022308 ...XMLWriter average time 0.0190546989441 ...Etree average time Results for 5000 entries: 1.80105571747 ...XMLWriter average time 0.999091792107 ...Etree average time """ from time import time from lxml.etree import Element import cStringIO from elementtree.SimpleXMLWriter import XMLWriter entries = 5000 entry = { "id": "1092309103910930", "creator": "automaticforthepeople", "title": "An entry in the portal_catalog", "description": "A longer textual description would go here", "created": "12/25/2005 00:00:00 GMT+1", "is_folderish": "1", "portal_type": "ATFolder", "review_state": "published", "path": "/Members/automaticforthepeople/junk", } def makeXMLWriter(): f = cStringIO.StringIO() tree = XMLWriter(f) root = tree.start("catalog") for i in range(entries): tree.start("entry", id=entry['id'], creator=entry['creator'], title=entry['title'], description=entry['description'], created=entry['created'], is_folderish=entry['is_folderish'], portal_type=entry['portal_type'], review_state=entry['review_state'], ) tree.end() tree.close(root) def makeEtree(): root = Element("catalog") for i in range(entries): item = Element("entry") item.set("id", entry["id"]) item.set("creator", entry["creator"]) item.set("title", entry["title"]) item.set("description", entry["description"]) item.set("created", entry["created"]) item.set("is_folderish", entry["is_folderish"]) item.set("portal_type", entry["portal_type"]) item.set("review_state", entry["review_state"]) root.append(item) def main(): repeat = 10 # Time first start1 = time() for i in range(repeat): makeXMLWriter() print (time() - start1)/repeat, "...XMLWriter average time" # Time second start2 = time() for i in range(repeat): makeEtree() print (time() - start2)/repeat, "...Etree average time" if __name__ == "__main__": main()
Paul Everitt wrote:
Hi. On my blog I mentioned that I was building up a tree from the portal catalog and that ElementTree's SimpleXMLWriter was 2.5 times faster than doing it via lxml. Martijn asked me to post here the code I used, as it shouldn't have been such a large difference.
He was right -- sorry!!!. In the original comparison, for the lxml part I turned the pathindex string into subnodes for paths but didn't do this for the XMLWriter part.
In a 1:1 comparison, lxml is actually a bit faster. For those interested, my test is below, with the results in the docstring.
Actually, you would expect lxml to be much faster, but looking through your code, I can imagine why it isn't.
I'll post a correction in my blog post. (Note: I'm using the scoder2 branch.)
Good choice! :)
""" A simple speed comparison of SimpleXMLWriter versus lxml for DOM node creation.
Results for 50 entries: 0.0206758022308 ...XMLWriter average time 0.0190546989441 ...Etree average time
Results for 5000 entries: 1.80105571747 ...XMLWriter average time 0.999091792107 ...Etree average time """ [...] def makeEtree(): root = Element("catalog") for i in range(entries): item = Element("entry") item.set("id", entry["id"]) item.set("creator", entry["creator"]) item.set("title", entry["title"]) item.set("description", entry["description"]) item.set("created", entry["created"]) item.set("is_folderish", entry["is_folderish"]) item.set("portal_type", entry["portal_type"]) item.set("review_state", entry["review_state"]) root.append(item)
Two problems here: calls to item.set() are not for free and adding an Element to a tree is even more costly. Try using -------------- def makeEtree(): root = Element("catalog") for i in range(entries): SubElement(root, "entry", entry) -------------- It should do the same, but I would expect it to be much, much faster. Maybe you could also try -------------- def makeEtree(): root = Element("catalog") for i in range(entries): SubElement(root, "entry", **entry) -------------- and see if that makes a difference. One more thing: could you avoid using tabs and the like? The Python code that left your mail program does not look very well formatted. Stefan
Paul Everitt wrote:
A simple speed comparison of SimpleXMLWriter versus lxml for DOM node creation.
Results for 50 entries: 0.0206758022308 ...XMLWriter average time 0.0190546989441 ...Etree average time
Results for 5000 entries: 1.80105571747 ...XMLWriter average time 0.999091792107 ...Etree average time
Here is what I get for this code: ----------------- def makeEtree0(): root = Element("catalog") for i in range(entries): item = Element("entry") item.set("id", entry["id"]) item.set("creator", entry["creator"]) item.set("title", entry["title"]) item.set("description", entry["description"]) item.set("created", entry["created"]) item.set("is_folderish", entry["is_folderish"]) item.set("portal_type", entry["portal_type"]) item.set("review_state", entry["review_state"]) root.append(item) def makeEtree1(): root = Element("catalog") for i in range(entries): item = SubElement(root, "entry") item.set("id", entry["id"]) item.set("creator", entry["creator"]) item.set("title", entry["title"]) item.set("description", entry["description"]) item.set("created", entry["created"]) item.set("is_folderish", entry["is_folderish"]) item.set("portal_type", entry["portal_type"]) item.set("review_state", entry["review_state"]) def makeEtree2(): root = Element("catalog") for i in range(entries): SubElement(root, "entry", entry) ----------------- 2.18129448891 ...XMLWriter average time 1.30788469315 ...Etree0 (orig) average time 1.28693380356 ...Etree1 (SubElement) average time 0.97448201179 ...Etree2 (SubElement+dict) average time Note that this is not a fair comparison to ElementTree's XMLWriter, since there may be better ways of using that API, too... Stefan
Thanks, Stefan, for the replies. First, thanks for the style tip (and performance tip) for passing in a dict and getting attributes set. Obviously that's quite a simplification. :^) I just ran the comparisons on OS X and got comparable differences. Also, for 10,000 entry elements, the memory stayed around 15 Mb during the tests. Sorry about the tabs, BTW. Also, thanks for the work on XSLT extension functions (and the other work as well). I hope to try out the extension functions during the coming days. --Paul Stefan Behnel wrote:
Paul Everitt wrote:
A simple speed comparison of SimpleXMLWriter versus lxml for DOM node creation.
Results for 50 entries: 0.0206758022308 ...XMLWriter average time 0.0190546989441 ...Etree average time
Results for 5000 entries: 1.80105571747 ...XMLWriter average time 0.999091792107 ...Etree average time
Here is what I get for this code:
----------------- def makeEtree0(): root = Element("catalog") for i in range(entries): item = Element("entry") item.set("id", entry["id"]) item.set("creator", entry["creator"]) item.set("title", entry["title"]) item.set("description", entry["description"]) item.set("created", entry["created"]) item.set("is_folderish", entry["is_folderish"]) item.set("portal_type", entry["portal_type"]) item.set("review_state", entry["review_state"]) root.append(item)
def makeEtree1(): root = Element("catalog") for i in range(entries): item = SubElement(root, "entry") item.set("id", entry["id"]) item.set("creator", entry["creator"]) item.set("title", entry["title"]) item.set("description", entry["description"]) item.set("created", entry["created"]) item.set("is_folderish", entry["is_folderish"]) item.set("portal_type", entry["portal_type"]) item.set("review_state", entry["review_state"])
def makeEtree2(): root = Element("catalog") for i in range(entries): SubElement(root, "entry", entry) -----------------
2.18129448891 ...XMLWriter average time 1.30788469315 ...Etree0 (orig) average time 1.28693380356 ...Etree1 (SubElement) average time 0.97448201179 ...Etree2 (SubElement+dict) average time
Note that this is not a fair comparison to ElementTree's XMLWriter, since there may be better ways of using that API, too...
Stefan
Paul Everitt wrote:
Thanks, Stefan, for the replies. First, thanks for the style tip (and performance tip) for passing in a dict and getting attributes set. Obviously that's quite a simplification. :^)
I just ran the comparisons on OS X and got comparable differences. Also, for 10,000 entry elements, the memory stayed around 15 Mb during the tests.
For the entire Python interpreter, I assume? Sounds somewhat reasonable, given the fact that the test actually generates 10x10.000 elements in total, including at least one Python object representation for each.
Also, thanks for the work on XSLT extension functions (and the other work as well). I hope to try out the extension functions during the coming days.
Sure, go ahead. It's good to have people try them to see if they work as they should (and as people expect). Still, you may consider not yet beting your company on them. They are not yet in the trunk, so there may be modifications (also at the API level) before they appear in an lxml release. I'm especially interested in what people think about the API provided by the Namespace class. I would personally prefer having that the primary API for extension functions/classes/elements. Stefan
Stefan Behnel wrote: [snip]
I'm especially interested in what people think about the API provided by the Namespace class. I would personally prefer having that the primary API for extension functions/classes/elements.
Perhaps you could sketch out the use of this API in a doctest? This should make it easier for people to evaluate things. I've reviewed the way to hook in custom element classes and am okay with the principle, so a doctest for that wouldn't be a waste of time. :) I'm curious to hear more about your thoughts about the API for extensibility -- how does it differ from the current XPath extension function for instance, and why it would be better. Regards, Martijn
Paul Everitt wrote:
Also, thanks for the work on XSLT extension functions (and the other work as well). I hope to try out the extension functions during the coming days.
Hi Stephan. I just read through the namespace_extensions.txt you checked in on your branch. First, I admit, I didn't realize you were doing it this way. It's a lot more interesting and exciting than I thought. :^) I don't know if you ever saw XBL in Mozilla, but it reminds me a bit of those ideas. My biggest question regards XSLT. You mention at the end that you can put a free-standing function in a namespace and call it from an XPath, presumably passing in arguments, etc. Will this also work from inside an XSLT processor? Two other small questions on the element binding. Are constructors and attributes (class or instance) allowed? Finally, a series of questions to understand the concepts involved. Can the element methods return an element, and thus be traversable? Imagine HonkElement having a method "honkstyles" that returned a nodeset of "honkstyle" entries. As XML text it would look like: <honk> <honkstyles> <style>quack</style> <style>alarm</style> </honkstyles> </honk> ...except the <honkstyles> and children didn't exist in the source XML. They are dynamically generated from Python, perhaps from a database, as you traversed it:
honk_styles = honk_element.honkstyles print len(honk_styles) 2
Presumably, XPath wouldn't work in getting <honkstyles>, but once gotten, things would work:
len(honk_element.xpath("honkstyles/style")) 0 len(honk_styles.xpath("style")) 2
--Paul
Paul Everitt wrote:
Hi Stephan.
Uhum: -f- :)
I just read through the namespace_extensions.txt you checked in on your branch. First, I admit, I didn't realize you were doing it this way. It's a lot more interesting and exciting than I thought. :^)
That's why I implemented it. :)
My biggest question regards XSLT. You mention at the end that you can put a free-standing function in a namespace and call it from an XPath, presumably passing in arguments, etc. Will this also work from inside an XSLT processor?
Not all of them, just lxml. :) If you call the xslt processor from lxml (at least in the current scoder2 branch), it will use extension functions just as in XPath itself. It uses the same infrastructure behind the scenes.
Two other small questions on the element binding. Are constructors and attributes (class or instance) allowed?
No. Element instances are stateless, all state is represented in the underlying XML. So, when you call .>>> Element('honk', test="5") it will return a HonkElement object. When you then look it up through some XPath call or tree traversal, it might return the same object or a new instance, if the old instance has been garbage collected. You're right, I should have mentioned that. You cannot use a constructor as you do not know when it will be called from within the API. So, no state, no initialization apart from the underlying XML. Thanks for asking questions, BTW, it makes me aware of what needs to be documented.
Finally, a series of questions to understand the concepts involved. Can the element methods return an element, and thus be traversable? Imagine HonkElement having a method "honkstyles" that returned a nodeset of "honkstyle" entries. As XML text it would look like:
<honk> <honkstyles> <style>quack</style> <style>alarm</style> </honkstyles> </honk>
...except the <honkstyles> and children didn't exist in the source XML. They are dynamically generated from Python, perhaps from a database, as you traversed it:
honk_styles = honk_element.honkstyles print len(honk_styles) 2
Presumably, XPath wouldn't work in getting <honkstyles>, but once gotten, things would work:
len(honk_element.xpath("honkstyles/style")) 0 len(honk_styles.xpath("style")) 2
I'll reply to two things: doing this from elements and doing this from XPath functions. Regarding elements: This works perfectly well. I already implemented something called an XPathModel, subclassing ElementBase, that is basically a meta-model based on XPath. You write XPath expressions into the class body and they are transformed into attributes and access/modifier methods at class creation time. I also have an autoconstructor, that creates children at first attribute access. So, all you'd have to do for the above, is to make honkstyles a property and create the necessary children on access. I don't think that would be possible for XPath functions, though. You could possibly do that, but don't blame me if it doesn't work. What you'd have to do is create those elements from within the function. The problem is, that I have no idea how libxml2 and lxml react if you modify the tree structure itself in an extension function, like adding new subtrees, for instance. You can try it if you like, would be interesting to know... What you might be able to do, though, is creating new, *independent* elements and return those. They would not become part of the tree structure, but at least you could have them pop up from an XPath call. Then, again, that might segfault just as well. So, to sum it up, these things are interesting, but they are not meant to work. Element classes, on the other hand, do not have that problem, as they are simple Python objects. Stefan
participants (3)
-
Martijn Faassen
-
Paul Everitt
-
Stefan Behnel