[lxml-dev] lxml branch

Hi Martijn, list I have checked code in a branch that allow to use a different parser than the default globally shared. This allows to setup different options than the default. All previous tests (321 ?) do pass. I'd be pleased to get feedback about the code before working on the supplementary tests needed. I am not totally sure I went the right way even if it works for me... Cheers -- Godefroid Chapelle (aka __gotcha)- BubbleNet http://bubblenet.be

Godefroid Chapelle wrote:
Hey, interesting. What was the concrete use case you wanted to tackle?
My main question is how does this tie into previous discussions and patches about parser options with Geert Jansen and Patrick Wagstrom? We now seem to have three different approaches which work in related areas, so I'm trying to figure out what to take from each. Stefan Behnel to the rescue, I hope! :) Regards, Martijn

Martijn Faassen wrote:
Godefroid Chapelle wrote:
I have checked code in a branch
... called 'gotcha-parser-setup', I assume ...
Good question. From what I understand, the patch introduces a parser choice at the ElementTree level that is propagated to XSLT parsing. I don't currently understand the reason for this.
It doesn't look like a conflicting submission. It seems to tackle a different problem: having multiple, differently configured parsers around. Still, at first glance, the propagation doesn't look very clean. We would have to do something like this everywhere, but I'd first like to hear about a use case. Stefan

Stefan Behnel wrote:
rigth
that is propagated to XSLT parsing.
I don't currently understand the reason for this.
propagation to xslt is needed both for compilation and for tests to pass
Yes, this is what I tried to do : avoid to setup options at a global level...
Still, at first glance, the propagation doesn't look very clean. We would have to do something like this everywhere, but I'd first like to hear about a use case.
... but global scope for options might not be a problem : I do not know enough about the PARSE options consequences.
Stefan
-- Godefroid Chapelle (aka __gotcha)- BubbleNet http://bubblenet.be

Godefroid Chapelle wrote:
Stefan Behnel wrote:
<snip>
Can someone tell me if global PARSE options could be a problem or not ? Thanks -- Godefroid Chapelle (aka __gotcha)- BubbleNet http://bubblenet.be

Godefroid Chapelle wrote:
Although I'm not completely sure I understand what you are asking, you may want to look at the other half of this thread. We are just trying to specify a good way of implementing a new interface to the parser configuration. And no, it does not require you to use any global options. My last response to this is here: http://codespeak.net/pipermail/lxml-dev/2005-November/000763.html You are very welcome to comment on the current state of ideas. Stefan

Martijn Faassen wrote:
I needed to be able to parse xhtml with its entities. This implied being able to set the XML_PARSE_DTDLOAD on the parser.
Sorry, I did not find those threads... can you point them to me ?
Regards,
Martijn
-- Godefroid Chapelle (aka __gotcha)- BubbleNet http://bubblenet.be

Godefroid Chapelle wrote:
I see. But that can already be done with Geert's patch. No need to propagate parsers for that.
No, but I can tell you where you can find them :) http://codespeak.net/pipermail/lxml-dev/2005-November/thread.html Stefan

Martijn Faassen wrote:
fwiw, if you want to be as ElementTree-compatible as you possibly can, I recommend the following approach: class XMLParser: def __init__(self, **options): self.options = options def parse(source, parser=None): ... if isinstance(parser, XMLParser): p = create parser, using parser.options for configuration else: p = create standard parser this lets the user use the standard doc = ET.parse(source, parser=XMLParser(configuration)) pattern to parse documents using non-default configurations. for extra compatibility points, you can add XMLParser-style target handling to the wrapper: class XMLParser: def __init__(self, ..., target=None, **options): self.target = target self.options = options if target: # set things up for SAX-style parsing; the event handlers # should call target.start, target.data, target.end etc. self.saxparser = ... def feed(self, data): self.saxparser.feed(data) def close(self): self.saxparser.done() self.saxparser = None # release resources </F>

Fredrik Lundh schrieb:
Funny idea to have the XMLParser class be completely empty except for dictionary storing. I guess the best would then be to let it inherit from dict and do nothing at all. :) This would actually make Geert's interface a bit nicer since we no longer need to supply all keyword arguments in all places where parsing happens. Having a single classes that stores them is also good for later extensions. I like this, it looks pretty clean to me.
Hmm, I guess that one's currently a bit too far in the future, since we don't have a SAX parser that uses libxml2's native SAX capabilities. Stefan

Stefan Behnel wrote:
it's the usage pattern that matters, not what's inside the class definition. but it's probably a good idea to explicitly spell out the available options in the constructor, e.g. class XMLParser: def __init__(self, a=..., b=..., c=...): ... validate option values ... self.options = ... depending on how libxml2 is designed, it may also be a good idea to move the actual parser creation into the parser wrapper: class XMLParser: def __init__(self, a=..., b=..., c=...): ... validate option values ... self.options = ... def _makeparser(self): ... construct appropriate parser based on options ... return parser </F>

Fredrik Lundh wrote:
Just wanted to highlight this; agreed with Fredrik here. How things get stored is up to the implementor of the class. The **options is a nice trick, but having explicit options is better where possible.
This is something we should consider. Godefroid, Stefan, Geert, opinions? Thanks again for the input, Fredrik! Much appreciated. Regards, Martijn

Martijn Faassen wrote:
Sure, makes things readable and helps people using eric and the like to get better online help.
Following Geert's patch, I think this would rather become something like class XMLParser: def __init__(self, parse_entities=False, ...): self.parse_entities = parse_entities ... cdef int _parseoptions(self): cdef int options options = 0 if not self.parse_entities: options |= xmlparser.XML_PARSE_NOENT ... return options instead of instantiating the actual parser in that class. Building the parse options is what matters here. I used 'cdef' to make this C-private since users should not be bothered with the internals of libxml2 options. Stefan

Stefan Behnel wrote:
I wonder if it would make sense to simplify the interface towards the SAX parser, too. Something like this: class SAXParser: def __init__(self): pass def _makeparser(self, sax_input): from sax import ElementTreeContentHandler return ElementTreeContentHandler(sax_input) And then we could extend the parse() functions in etree to accept both. This would even allow us to use the same interface for parsing from strings: class XMLStringParser(XMLParser): pass Users could then throw either a SAX source, a filename, a file-like object or a string into parse(), without a major difference in usage, just by selecting/configuring an appropriate parser. We would just have to check the type of the parser input object. Maybe having the common function _makeparser in them would still be a good idea then, to simplify the type checking. What do you think? Stefan

Fredrik Lundh wrote:
I like this pattern of usage... I did not dare to do it this way in my branch since I do not know if the parser interferes with the representation it produces. For instance, __copy__ creates a new document (parser.newDoc()). Do the parser options need to be the same for the copy than the options of the parser for the original document ? Why is the parser needed for a copy ? I do not understand all the ramifications of the code. BTW, where can I find Geert's patch; I did not manage to find it myself. <snip> -- Godefroid Chapelle (aka __gotcha)- BubbleNet http://bubblenet.be

Godefroid Chapelle wrote:
BTW, where can I find Geert's patch; I did not manage to find it myself.
The latest version I've seen was here: http://codespeak.net/pipermail/lxml-dev/2005-November/000684.html Stefan

Stefan Behnel wrote:
I finally found time to look at the patch. However, I am not used to read diff files and I did not understand how the parse options are supposed to be setup. Geert, can you explain ? Thanks -- Godefroid Chapelle (aka __gotcha)- BubbleNet http://bubblenet.be

Godefroid Chapelle wrote:
Hey, interesting. What was the concrete use case you wanted to tackle?
My main question is how does this tie into previous discussions and patches about parser options with Geert Jansen and Patrick Wagstrom? We now seem to have three different approaches which work in related areas, so I'm trying to figure out what to take from each. Stefan Behnel to the rescue, I hope! :) Regards, Martijn

Martijn Faassen wrote:
Godefroid Chapelle wrote:
I have checked code in a branch
... called 'gotcha-parser-setup', I assume ...
Good question. From what I understand, the patch introduces a parser choice at the ElementTree level that is propagated to XSLT parsing. I don't currently understand the reason for this.
It doesn't look like a conflicting submission. It seems to tackle a different problem: having multiple, differently configured parsers around. Still, at first glance, the propagation doesn't look very clean. We would have to do something like this everywhere, but I'd first like to hear about a use case. Stefan

Stefan Behnel wrote:
rigth
that is propagated to XSLT parsing.
I don't currently understand the reason for this.
propagation to xslt is needed both for compilation and for tests to pass
Yes, this is what I tried to do : avoid to setup options at a global level...
Still, at first glance, the propagation doesn't look very clean. We would have to do something like this everywhere, but I'd first like to hear about a use case.
... but global scope for options might not be a problem : I do not know enough about the PARSE options consequences.
Stefan
-- Godefroid Chapelle (aka __gotcha)- BubbleNet http://bubblenet.be

Godefroid Chapelle wrote:
Stefan Behnel wrote:
<snip>
Can someone tell me if global PARSE options could be a problem or not ? Thanks -- Godefroid Chapelle (aka __gotcha)- BubbleNet http://bubblenet.be

Godefroid Chapelle wrote:
Although I'm not completely sure I understand what you are asking, you may want to look at the other half of this thread. We are just trying to specify a good way of implementing a new interface to the parser configuration. And no, it does not require you to use any global options. My last response to this is here: http://codespeak.net/pipermail/lxml-dev/2005-November/000763.html You are very welcome to comment on the current state of ideas. Stefan

Martijn Faassen wrote:
I needed to be able to parse xhtml with its entities. This implied being able to set the XML_PARSE_DTDLOAD on the parser.
Sorry, I did not find those threads... can you point them to me ?
Regards,
Martijn
-- Godefroid Chapelle (aka __gotcha)- BubbleNet http://bubblenet.be

Godefroid Chapelle wrote:
I see. But that can already be done with Geert's patch. No need to propagate parsers for that.
No, but I can tell you where you can find them :) http://codespeak.net/pipermail/lxml-dev/2005-November/thread.html Stefan

Martijn Faassen wrote:
fwiw, if you want to be as ElementTree-compatible as you possibly can, I recommend the following approach: class XMLParser: def __init__(self, **options): self.options = options def parse(source, parser=None): ... if isinstance(parser, XMLParser): p = create parser, using parser.options for configuration else: p = create standard parser this lets the user use the standard doc = ET.parse(source, parser=XMLParser(configuration)) pattern to parse documents using non-default configurations. for extra compatibility points, you can add XMLParser-style target handling to the wrapper: class XMLParser: def __init__(self, ..., target=None, **options): self.target = target self.options = options if target: # set things up for SAX-style parsing; the event handlers # should call target.start, target.data, target.end etc. self.saxparser = ... def feed(self, data): self.saxparser.feed(data) def close(self): self.saxparser.done() self.saxparser = None # release resources </F>

Fredrik Lundh schrieb:
Funny idea to have the XMLParser class be completely empty except for dictionary storing. I guess the best would then be to let it inherit from dict and do nothing at all. :) This would actually make Geert's interface a bit nicer since we no longer need to supply all keyword arguments in all places where parsing happens. Having a single classes that stores them is also good for later extensions. I like this, it looks pretty clean to me.
Hmm, I guess that one's currently a bit too far in the future, since we don't have a SAX parser that uses libxml2's native SAX capabilities. Stefan

Stefan Behnel wrote:
it's the usage pattern that matters, not what's inside the class definition. but it's probably a good idea to explicitly spell out the available options in the constructor, e.g. class XMLParser: def __init__(self, a=..., b=..., c=...): ... validate option values ... self.options = ... depending on how libxml2 is designed, it may also be a good idea to move the actual parser creation into the parser wrapper: class XMLParser: def __init__(self, a=..., b=..., c=...): ... validate option values ... self.options = ... def _makeparser(self): ... construct appropriate parser based on options ... return parser </F>

Fredrik Lundh wrote:
Just wanted to highlight this; agreed with Fredrik here. How things get stored is up to the implementor of the class. The **options is a nice trick, but having explicit options is better where possible.
This is something we should consider. Godefroid, Stefan, Geert, opinions? Thanks again for the input, Fredrik! Much appreciated. Regards, Martijn

Martijn Faassen wrote:
Sure, makes things readable and helps people using eric and the like to get better online help.
Following Geert's patch, I think this would rather become something like class XMLParser: def __init__(self, parse_entities=False, ...): self.parse_entities = parse_entities ... cdef int _parseoptions(self): cdef int options options = 0 if not self.parse_entities: options |= xmlparser.XML_PARSE_NOENT ... return options instead of instantiating the actual parser in that class. Building the parse options is what matters here. I used 'cdef' to make this C-private since users should not be bothered with the internals of libxml2 options. Stefan

Stefan Behnel wrote:
I wonder if it would make sense to simplify the interface towards the SAX parser, too. Something like this: class SAXParser: def __init__(self): pass def _makeparser(self, sax_input): from sax import ElementTreeContentHandler return ElementTreeContentHandler(sax_input) And then we could extend the parse() functions in etree to accept both. This would even allow us to use the same interface for parsing from strings: class XMLStringParser(XMLParser): pass Users could then throw either a SAX source, a filename, a file-like object or a string into parse(), without a major difference in usage, just by selecting/configuring an appropriate parser. We would just have to check the type of the parser input object. Maybe having the common function _makeparser in them would still be a good idea then, to simplify the type checking. What do you think? Stefan

Fredrik Lundh wrote:
I like this pattern of usage... I did not dare to do it this way in my branch since I do not know if the parser interferes with the representation it produces. For instance, __copy__ creates a new document (parser.newDoc()). Do the parser options need to be the same for the copy than the options of the parser for the original document ? Why is the parser needed for a copy ? I do not understand all the ramifications of the code. BTW, where can I find Geert's patch; I did not manage to find it myself. <snip> -- Godefroid Chapelle (aka __gotcha)- BubbleNet http://bubblenet.be

Godefroid Chapelle wrote:
BTW, where can I find Geert's patch; I did not manage to find it myself.
The latest version I've seen was here: http://codespeak.net/pipermail/lxml-dev/2005-November/000684.html Stefan

Stefan Behnel wrote:
I finally found time to look at the patch. However, I am not used to read diff files and I did not understand how the parse options are supposed to be setup. Geert, can you explain ? Thanks -- Godefroid Chapelle (aka __gotcha)- BubbleNet http://bubblenet.be
participants (4)
-
Fredrik Lundh
-
Godefroid Chapelle
-
Martijn Faassen
-
Stefan Behnel