[lxml-dev] lxml & parsing: return of a classes

So I was thinking a little about how we could allow easy customization of the URL getter, since we can't attach it to the tree or any element. And then generally how any customization could be done, for instance if you want a new method on all elements. This isn't that easy currently. You'd have to subclass a bunch of classes and rewrite a bunch of functions. But I think if we move all parsing to a single class it would help a great deal. The idea is something like: class Parser(object): _etree_parser_class = etree.HTMLParser def __init__(self): self._etree_parser = self._etree_parser_class() self._etree_parser.setElementClassLookup(self) def __call__(self, filename, **kw): return etree.parse(filename, self._etree_parser, **kw) def fromstring(...): ... And so forth. Then either expose this via: parse = Parser() Or perhaps: _parser = Parser() parse = _parser fromstring = _parser.fromstring And so forth. If you want to adjust something, you don't have to reimplement all the forms of parsers, since they all would just use self, and are mostly defined in terms of each other. We could support subclassing with something like this: class Parser(object): _element_classes = {} _element_mixins = {} def __init__(self): self._element_classes = self._element_classes.copy() mixers = {} for name, value in _element_mixins: if name == '*': for n in self._element_classes.keys(): mixers.setdefault(n, []).append(value) else: mixers.setdefault(name, []).append(value) for name, mixins in mixers: cur = self._element_classes.get(name, HtmlElement) bases = mixins + [cur] new_class = type(cur.__name__, tuple(bases), {}) self._element_classes[name] = new_class class MyMixin(object): extra methods class FormMixin(object): other methods for the form element class ParserMixedIn(Parser): _element_mixins = {'*': MyMixin, 'form': FormMixin} And then it would be really easy to create local extensions for all HTML elements, or particular elements. I'm not sure exactly how to attach the URL getting method to the Parser object in this model, because I'm not sure how to give elements a reference back to it. We could do it with class variables, but then the parser would *have* to subclass every element everytime it was instantiated, so it could make new classes with a reference back to itself. But maybe there's a better way. Do the elements already have a reference back to that etree.HTMLParser() instance, and could we attach this to that instance? Or perhaps extend HTMLParser directly instead of having this other parser class? -- Ian Bicking : ianb@colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers

Hi Ian, Ian Bicking wrote:
That's a good idea, but as you suggest at the end, extending the HTMLParser class directly is the way to go. Documents in lxml.etree keep a reference to their parser to support inheritance of resolvers. It's even readable from Python as "parser" property of an ElementTree. That would nicely solve most of your problems.
I would have to see how this looks if you inherit from HTMLParser and how this matches with the existing class lookup mechanisms.
I think we should try to integrate with the normal Resolver mechanism here (doc/resolvers.txt). Not sure how this works exactly if we want to use it from Python code (currently it's only called from libxml2 internally), but I would like to avoid adding yet another way to resolve URLs. Currently, resolvers receive an opaque "context" object as last argument and return an opaque object with a string or file-like object etc. We could easily replace the context with an object containing a sequence of form arguments (which would be None when calling from libxml2). Stefan

Stefan Behnel wrote:
Ok, this is basically how resolvers work currently: They have a resolve() method that takes a system URL and a public ID (as in a DTD DOCTYPE), as well as an opaque "context" object. They return another opaque object of type _InputDocument, that is created by calling one of the resolve_*() methods in the _Resolver base type. It is evaluated internally to read from the source (string, file, ...) that was passed to resolve_*(). I could imagine making the parse() function aware of _InputDocument as input type, so that you could subclass _Resolver in your use case, call its resolve() method directly from the submit() method and return the result so that the user can pass the return value to parse() and have it read the result into a tree. This would allow "parse( form.submit() )" to work with the existing resolver infrastructure. Current problems: - resolvers do not support URL options (?x=y&a=b). As described above, this would have to be passed through the context object somehow. - this would work with parse(), but there's also the case where the result is not XML or HTML. We would need a different API to retrieve the result as a bare string. So, this would work, but it's far from clean currently. We should put some more thoughts into this. Stefan

Stefan Behnel wrote:
I thought about this some more and I now think that it would be inappropriate to use etree's resolver interface here. It serves a totally different purpose and the extension with form data would be useless everywhere else. A simple function passed as argument to submit would do. If you want a set-once-use-everywhere setup, the parser is the only place where this would work, although I dislike the idea of using the /parser/ to set a method for /submitting/ form data. Especially if you have to parse the result yourself. I now made the submit() method a "submit_form()" module function. That way, you can easily write your own function with the same interface that simply passes the appropriate HTTP mechanism in for you. The signature is: def submit_form(form, extra_values=None, open_http=None) Stefan

Hi Ian, Ian Bicking wrote:
That's a good idea, but as you suggest at the end, extending the HTMLParser class directly is the way to go. Documents in lxml.etree keep a reference to their parser to support inheritance of resolvers. It's even readable from Python as "parser" property of an ElementTree. That would nicely solve most of your problems.
I would have to see how this looks if you inherit from HTMLParser and how this matches with the existing class lookup mechanisms.
I think we should try to integrate with the normal Resolver mechanism here (doc/resolvers.txt). Not sure how this works exactly if we want to use it from Python code (currently it's only called from libxml2 internally), but I would like to avoid adding yet another way to resolve URLs. Currently, resolvers receive an opaque "context" object as last argument and return an opaque object with a string or file-like object etc. We could easily replace the context with an object containing a sequence of form arguments (which would be None when calling from libxml2). Stefan

Stefan Behnel wrote:
Ok, this is basically how resolvers work currently: They have a resolve() method that takes a system URL and a public ID (as in a DTD DOCTYPE), as well as an opaque "context" object. They return another opaque object of type _InputDocument, that is created by calling one of the resolve_*() methods in the _Resolver base type. It is evaluated internally to read from the source (string, file, ...) that was passed to resolve_*(). I could imagine making the parse() function aware of _InputDocument as input type, so that you could subclass _Resolver in your use case, call its resolve() method directly from the submit() method and return the result so that the user can pass the return value to parse() and have it read the result into a tree. This would allow "parse( form.submit() )" to work with the existing resolver infrastructure. Current problems: - resolvers do not support URL options (?x=y&a=b). As described above, this would have to be passed through the context object somehow. - this would work with parse(), but there's also the case where the result is not XML or HTML. We would need a different API to retrieve the result as a bare string. So, this would work, but it's far from clean currently. We should put some more thoughts into this. Stefan

Stefan Behnel wrote:
I thought about this some more and I now think that it would be inappropriate to use etree's resolver interface here. It serves a totally different purpose and the extension with form data would be useless everywhere else. A simple function passed as argument to submit would do. If you want a set-once-use-everywhere setup, the parser is the only place where this would work, although I dislike the idea of using the /parser/ to set a method for /submitting/ form data. Especially if you have to parse the result yourself. I now made the submit() method a "submit_form()" module function. That way, you can easily write your own function with the same interface that simply passes the appropriate HTTP mechanism in for you. The signature is: def submit_form(form, extra_values=None, open_http=None) Stefan
participants (2)
-
Ian Bicking
-
Stefan Behnel