Mailman 3 [lxml-dev] SAX review (part 1) - lxml - The Python XML Toolkit

11 Nov 2005

      Stefan Behnel wrote:
...
I'd finally like to get my "bigfat.patch" a bit smaller. The main
additions that are still pending are these four:
* SAX I/O
Okay, starting to take a look at your SAX work now, giving my feedback, 
thoughts and questions as I go.

I understand that you implemented a ElementTree builder that consumes 
Python SAX events first, and then later on you created a Python-style 
SAX event producer on top of the ElementTree API. Very cool! I have 
quite a bit of SAX code lying around that may eventually benefit from this.

It'd be nice to eventually have a Python SAX producer on top of 
libxml2's SAX2 APIs directly as well.

First a few small Python style issues I noticed:

* I want to stick with a character limit per line of 80 characters. 
Longer lines should be broken up. If they can be broken 'naturally'
with a newline due to () and such that's preferable to the use of the 
backslash.

* I see you paired for instance startElementNS and endElementNS together 
without an empty line between them. I'd like to stick to empty lines 
between all method definitions.

* I prefer the use of the unicode literal u'' to a barebones unicode() 
call. I think it's a bit clearer to use the literal for this.

Now on to more details of the code, focusing on the SaxTreeBuilder class 
for now (I'll look at saxify later).

* I presume that due to the namespace reconciliation when adding 
elements repeating the ns_mapping each time you create an element is 
safe and won't result in a lot of namespace definitions in the 
serialized XML, correct?

* This code is *almost* compatible with ElementTree proper, except for 
the namespaces bit. For future potential reusability it might be nicer 
indeed to put it in a separate module instead of integrating it into 
etree.pyx.

* Hm, wouldn't characters() be a bit nicer like this? (untested)

parent = self._element_stack[-1]
try:
    # the text comes after another element, thus tail
    element = parent[-1]
    element.tail = (element.tail or u'') + data
except IndexError:
    # the text is before any element in the parent, thus text
    parent.text = (parent.text or u'') + data

* while this shouldn't matter normally, it might be nice to stick with 
the argument names for the content handler defined in the Python 
library, like name and qname.

* I prefer tuple unpacking of the 'tag_tuple' ('name'), like this. I 
also don't think it's necessary to have a template for such a simple 
thing as "{%s}%s"; repeating it a few times is actually less text than 
using the template... No need to make it a unicode string either, though 
it doesn't matter much in this case. (It would matter more for plain 
ElementTree -- this stores ascii strings where possible, so *not* 
forcing things into unicode might help it slightly, though I think SAX 
delivers unicode always so it wouldn't matter after all..). Also look at 
the bug I think I found (see comment).

ns_uri, localname = name

and then:

if ns_uri:
    el_name = '{%s}%s' % (ns_uri, localname)
elif self._default_ns:
    # order seems reversed in your code. Is this a bug or my
    # misunderstanding? Would be nice to have a test which
    # exercises this code!
    el_name = '{%s}%s' % (self._default_ns, localname)
else:
    el_name = localname

Anyway, going home now and not entirely done with the review yet. Hope 
this was useful! Lots of this was about minor matters of taste and style 
so far, but it's good to have a uniform style in lxml...

Regards,

Martijn

[lxml-dev] SAX review (part 1)

Martijn Faassen

Stefan Behnel

Stefan Behnel

Martijn Faassen

Stefan Behnel

Martijn Faassen

tags

participants (2)