[lxml-dev] SAX review (part 1)
Stefan Behnel wrote:
I'd finally like to get my "bigfat.patch" a bit smaller. The main additions that are still pending are these four:
* SAX I/O
Okay, starting to take a look at your SAX work now, giving my feedback, thoughts and questions as I go. I understand that you implemented a ElementTree builder that consumes Python SAX events first, and then later on you created a Python-style SAX event producer on top of the ElementTree API. Very cool! I have quite a bit of SAX code lying around that may eventually benefit from this. It'd be nice to eventually have a Python SAX producer on top of libxml2's SAX2 APIs directly as well. First a few small Python style issues I noticed: * I want to stick with a character limit per line of 80 characters. Longer lines should be broken up. If they can be broken 'naturally' with a newline due to () and such that's preferable to the use of the backslash. * I see you paired for instance startElementNS and endElementNS together without an empty line between them. I'd like to stick to empty lines between all method definitions. * I prefer the use of the unicode literal u'' to a barebones unicode() call. I think it's a bit clearer to use the literal for this. Now on to more details of the code, focusing on the SaxTreeBuilder class for now (I'll look at saxify later). * I presume that due to the namespace reconciliation when adding elements repeating the ns_mapping each time you create an element is safe and won't result in a lot of namespace definitions in the serialized XML, correct? * This code is *almost* compatible with ElementTree proper, except for the namespaces bit. For future potential reusability it might be nicer indeed to put it in a separate module instead of integrating it into etree.pyx. * Hm, wouldn't characters() be a bit nicer like this? (untested) parent = self._element_stack[-1] try: # the text comes after another element, thus tail element = parent[-1] element.tail = (element.tail or u'') + data except IndexError: # the text is before any element in the parent, thus text parent.text = (parent.text or u'') + data * while this shouldn't matter normally, it might be nice to stick with the argument names for the content handler defined in the Python library, like name and qname. * I prefer tuple unpacking of the 'tag_tuple' ('name'), like this. I also don't think it's necessary to have a template for such a simple thing as "{%s}%s"; repeating it a few times is actually less text than using the template... No need to make it a unicode string either, though it doesn't matter much in this case. (It would matter more for plain ElementTree -- this stores ascii strings where possible, so *not* forcing things into unicode might help it slightly, though I think SAX delivers unicode always so it wouldn't matter after all..). Also look at the bug I think I found (see comment). ns_uri, localname = name and then: if ns_uri: el_name = '{%s}%s' % (ns_uri, localname) elif self._default_ns: # order seems reversed in your code. Is this a bug or my # misunderstanding? Would be nice to have a test which # exercises this code! el_name = '{%s}%s' % (self._default_ns, localname) else: el_name = localname Anyway, going home now and not entirely done with the review yet. Hope this was useful! Lots of this was about minor matters of taste and style so far, but it's good to have a uniform style in lxml... Regards, Martijn
Martijn Faassen wrote:
Okay, starting to take a look at your SAX work now, giving my feedback, thoughts and questions as I go.
I understand that you implemented a ElementTree builder that consumes Python SAX events first, and then later on you created a Python-style SAX event producer on top of the ElementTree API. Very cool! I have quite a bit of SAX code lying around that may eventually benefit from this.
That's what it is for. You can use it to interface with any SAX capable Python API, i.e. basically every XML API for Python.
It'd be nice to eventually have a Python SAX producer on top of libxml2's SAX2 APIs directly as well.
Eventually, sure, but this implementation works *now*. It's much more work to implement this bug-free on top of libxml than on the public API. Related to that:
* This code is *almost* compatible with ElementTree proper, except for the namespaces bit. For future potential reusability it might be nicer indeed to put it in a separate module instead of integrating it into etree.pyx.
I had it outside at the beginning, but then I saw that ElemenTree also has its TreeBuilder inside the main module. And since this implementation may eventually also be ported to the libxml2 API (SAX is something that's supposed to be fast, see cocoon etc.), I prefered having it inside the Pyrex code.
First a few small Python style issues I noticed:
* I want to stick with a character limit per line of 80 characters. Longer lines should be broken up. If they can be broken 'naturally' with a newline due to () and such that's preferable to the use of the backslash.
Sure, I changed that.
* I see you paired for instance startElementNS and endElementNS together without an empty line between them. I'd like to stick to empty lines between all method definitions.
Ok, sure. I basically pared methods that actually do belong together. But since startElementNS is pretty long, it makes sense to separate them.
* I prefer the use of the unicode literal u'' to a barebones unicode() call. I think it's a bit clearer to use the literal for this.
That's a problem. I (obviously) tried that first and then found that Pyrex doesn't support unicode literals. So I had to take that tweak to get a unicode string. Same for the template. I used it to avoid having to write unicode("{%s}%s") % ... everywhere which would be unnecessarily inefficient.
Now on to more details of the code, focusing on the SaxTreeBuilder class for now (I'll look at saxify later).
* I presume that due to the namespace reconciliation when adding elements repeating the ns_mapping each time you create an element is safe and won't result in a lot of namespace definitions in the serialized XML, correct?
Um, well. I didn't see any superfluous NS definitions so far, so I guess that's just ok.
* Hm, wouldn't characters() be a bit nicer like this? (untested)
parent = self._element_stack[-1] try: # the text comes after another element, thus tail element = parent[-1] element.tail = (element.tail or u'') + data except IndexError: # the text is before any element in the parent, thus text parent.text = (parent.text or u'') + data
I think its quite readable like this: def characters(self, data): last_element = self._element_stack[-1] try: # if there already is a child element, we must append to its tail last_element = last_element[-1] last_element.tail = (last_element.tail or unicode()) + data except IndexError: # otherwise: append to the text last_element.text = (last_element.text or unicode()) + data
* while this shouldn't matter normally, it might be nice to stick with the argument names for the content handler defined in the Python library, like name and qname.
Sure.
* I prefer tuple unpacking of the 'tag_tuple' ('name'), like this. I also don't think it's necessary to have a template for such a simple thing as "{%s}%s"; repeating it a few times is actually less text than using the template... No need to make it a unicode string either, though it doesn't matter much in this case.
You're right, I didn't know Python did autoconversion in that direction, too: .>>> 'a%s' % u'ü' u'a\xfc' That simplifies things, guess I can remove the template then.
if ns_uri: el_name = '{%s}%s' % (ns_uri, localname) elif self._default_ns: # order seems reversed in your code. Is this a bug or my # misunderstanding? Would be nice to have a test which # exercises this code! el_name = '{%s}%s' % (self._default_ns, localname) else: el_name = localname
True. I'm actually surprised this didn't show up in the tests. Maybe I'll have to check my saxyfier there. I'll put these little changes back into my branch. Stefan
Stefan Behnel wrote:
Martijn Faassen wrote:
if ns_uri: el_name = '{%s}%s' % (ns_uri, localname) elif self._default_ns: # order seems reversed in your code. Is this a bug or my # misunderstanding? Would be nice to have a test which # exercises this code! el_name = '{%s}%s' % (self._default_ns, localname) else: el_name = localname
True. I'm actually surprised this didn't show up in the tests. Maybe I'll have to check my saxyfier there.
No, I'm actually not surprised. _recursive_saxify/_build_qname never produce a default (i.e. None) prefix since they always generate a prefix for each new namespace. I'd have to write a test that generates SAX events by hand to test that. blurgh... :) Stefan
Stefan Behnel wrote:
Martijn Faassen wrote: [snip]
It'd be nice to eventually have a Python SAX producer on top of libxml2's SAX2 APIs directly as well.
Eventually, sure, but this implementation works *now*. It's much more work to implement this bug-free on top of libxml than on the public API.
Yes, just to be clear here, I said it'd be nice eventually, not asking you to build it now. :) This is indeed very interesting already.
Related to that:
* This code is *almost* compatible with ElementTree proper, except for the namespaces bit. For future potential reusability it might be nicer indeed to put it in a separate module instead of integrating it into etree.pyx.
I had it outside at the beginning, but then I saw that ElemenTree also has its TreeBuilder inside the main module. And since this implementation may eventually also be ported to the libxml2 API (SAX is something that's supposed to be fast, see cocoon etc.), I prefered having it inside the Pyrex code.
I'd still prefer to have it as a separate .pyx file. If you want lxml's ElementTree functionality, you'd import etree, if you want lxml's SAX functionality, you'd import sax, and for DOM you'd import dom. Of course the sax module would build on etree, but you could imagine the DOM support has a mechanism to get from ElementTree to DOM as well... Eventually we might even have an Amara style bindery or whatnot. :)
First a few small Python style issues I noticed:
* I want to stick with a character limit per line of 80 characters. Longer lines should be broken up. If they can be broken 'naturally' with a newline due to () and such that's preferable to the use of the backslash.
Sure, I changed that.
Thanks!
* I see you paired for instance startElementNS and endElementNS together without an empty line between them. I'd like to stick to empty lines between all method definitions.
Ok, sure. I basically pared methods that actually do belong together. But since startElementNS is pretty long, it makes sense to separate them.
Right, I understood that and it's not necessarily a bad idea, but I just want to stick to the general principle of an empty line after each function across lxml.
* I prefer the use of the unicode literal u'' to a barebones unicode() call. I think it's a bit clearer to use the literal for this.
That's a problem. I (obviously) tried that first and then found that Pyrex doesn't support unicode literals. So I had to take that tweak to get a unicode string. Same for the template. I used it to avoid having to write unicode("{%s}%s") % ... everywhere which would be unnecessarily inefficient.
Makes sense, I didn't realize (or forgot) that unicode literals don't work in Pyrex.
Now on to more details of the code, focusing on the SaxTreeBuilder class for now (I'll look at saxify later).
* I presume that due to the namespace reconciliation when adding elements repeating the ns_mapping each time you create an element is safe and won't result in a lot of namespace definitions in the serialized XML, correct?
Um, well. I didn't see any superfluous NS definitions so far, so I guess that's just ok.
It could be useful to expand the unit tests to try a few basic cases. I haven't reviewed the tests yet so I'm just making the assumption that we don't have such a test now. :)
* Hm, wouldn't characters() be a bit nicer like this? (untested)
parent = self._element_stack[-1] try: # the text comes after another element, thus tail element = parent[-1] element.tail = (element.tail or u'') + data except IndexError: # the text is before any element in the parent, thus text parent.text = (parent.text or u'') + data
I think its quite readable like this:
def characters(self, data): last_element = self._element_stack[-1] try: # if there already is a child element, we must append to its tail last_element = last_element[-1] last_element.tail = (last_element.tail or unicode()) + data except IndexError: # otherwise: append to the text last_element.text = (last_element.text or unicode()) + data
Okay. I just wanted to be a bit more clear about the use of the word 'element' originally, but this looks fine.
* while this shouldn't matter normally, it might be nice to stick with the argument names for the content handler defined in the Python library, like name and qname.
Sure.
Thanks.
* I prefer tuple unpacking of the 'tag_tuple' ('name'), like this. I also don't think it's necessary to have a template for such a simple thing as "{%s}%s"; repeating it a few times is actually less text than using the template... No need to make it a unicode string either, though it doesn't matter much in this case.
You're right, I didn't know Python did autoconversion in that direction, too: .>>> 'a%s' % u'ü' u'a\xfc'
That simplifies things, guess I can remove the template then.
Okay, cool, are you going to use the tuple splitting too?
if ns_uri: el_name = '{%s}%s' % (ns_uri, localname) elif self._default_ns: # order seems reversed in your code. Is this a bug or my # misunderstanding? Would be nice to have a test which # exercises this code! el_name = '{%s}%s' % (self._default_ns, localname) else: el_name = localname
True. I'm actually surprised this didn't show up in the tests. Maybe I'll have to check my saxyfier there.
I'll put these little changes back into my branch.
Thanks! We should start thinking about how to go ahead and port the sax-based builder toward the trunk then. I'll do a review of the rest o the SAX code at first opportunity. Regards, Martijn
Martijn Faassen wrote:
Stefan Behnel wrote:
* I prefer tuple unpacking of the 'tag_tuple' ('name'), like this. I also don't think it's necessary to have a template for such a simple thing as "{%s}%s"; repeating it a few times is actually less text than using the template... No need to make it a unicode string either, though it doesn't matter much in this case.
That simplifies things, guess I can remove the template then.
Okay, cool, are you going to use the tuple splitting too?
Yes, but not everwhere. It only unnecessarily clutters the function with variables. I mean, how many different variable names can you imagine for "namespace_uri" and "tag_name"? And why would you want to use them all? Stefan
Stefan Behnel wrote:
Martijn Faassen wrote:
Stefan Behnel wrote:
* I prefer tuple unpacking of the 'tag_tuple' ('name'), like this. I also don't think it's necessary to have a template for such a simple thing as "{%s}%s"; repeating it a few times is actually less text than using the template... No need to make it a unicode string either, though it doesn't matter much in this case.
That simplifies things, guess I can remove the template then.
Okay, cool, are you going to use the tuple splitting too?
Yes, but not everwhere. It only unnecessarily clutters the function with variables. I mean, how many different variable names can you imagine for "namespace_uri" and "tag_name"? And why would you want to use them all?
I'm not sure what you mean by this, why would I want to imagine lots of variable names for this? I was talking about using explicit names instead of foo[0] and foo[1]. You split the tuple somewhere in the beginning of the function, right? namespace_uri, localname = .. or even directly in the function arguments (if Pyrex supports it); def startElementNS((namespace_uri, localname), ..): Regards, Martijn
participants (2)
-
Martijn Faassen
-
Stefan Behnel