[lxml-dev] lxml.html and forms
I feel a little bad adding a bunch of stuff to lxml.html when it's supposed to get all stable. But I was getting ready for a presentation on lxml.html, and this seemed like it made it a lot more fun. So with my last commit you can do things like: from lxml.html import parse, open_in_browser url = 'http://tripsweb.rtachicago.com/' page = parse(url) page.make_links_absolute(url) form = page.forms[0] form.inputs['Orig'].value = '1535 W Leland' form.inputs['Dest'].value = '847 W Bertrand' res = form.submit() res_page = parse(res) res_page.make_links_absolute(res_page.geturl()) open_in_browser(res_page) It's kind of like Mechanize, only of course better. There's some things I still haven't figured out. Some data structures are convenient, but maybe have some non-obvious aspects. Like form.inputs, which doesn't always return elements (for things like checkboxes it can return something that is more like a logical element). Also, I'd like to merge in most of the functionality of lxml.html.formfill (except for error-filling), so where form.form_values() currently returns a list of all the values as they'd be if submitted, I'd like to make it settable. And maybe even have form.form_values return something that would modify inputs in-place, like form.form_values['Orig'] = '1535 W Leland' mean the equivalent of form.inputs['Orig'].value = '1535 W Leland'. Another option question is actual form submission. Right now it uses urllib. But I like httplib2, for instance, and I'd like it to be possible to use that. Also, I'm wondering about how to keep track of the URL when a page is parsed. Stefan mentioned if you use parse(url) it would keep track of that... where? I'd like it to be possible to keep the URL around for any kind of parsing, e.g., with document_fromstring(html, url=X). Ian
Hi Ian, Ian Bicking wrote:
I feel a little bad adding a bunch of stuff to lxml.html when it's supposed to get all stable.
And you should! You're lucky that there will be more than one 2.0alpha version. :) No, seriously. We do this for fun, right? And adding cool stuff from time to time is a pretty good way to keep up the motivation.
So with my last commit you can do things like:
from lxml.html import parse, open_in_browser url = 'http://tripsweb.rtachicago.com/' page = parse(url) page.make_links_absolute(url) form = page.forms[0] form.inputs['Orig'].value = '1535 W Leland' form.inputs['Dest'].value = '847 W Bertrand' res = form.submit() res_page = parse(res) res_page.make_links_absolute(res_page.geturl()) open_in_browser(res_page)
Sounds like you should put something like that into the docs. (hint, hint)
It's kind of like Mechanize, only of course better. There's some things I still haven't figured out. Some data structures are convenient, but maybe have some non-obvious aspects. Like form.inputs, which doesn't always return elements (for things like checkboxes it can return something that is more like a logical element).
Have I ever encouraged you to look at objectify? It has special data Elements that behave like normal Python data classes, but are actually objects. Something similar could apply here, you could use a string-like Element for "input" and a boolean-like Element for "checkbox". Hmmm, and radio buttons could be lists? Although a boolean-like Element always has the disadvantage that bool() would behave different for it than for an in-tree element (i.e.: does it have children?)
Also, I'd like to merge in most of the functionality of lxml.html.formfill (except for error-filling), so where form.form_values() currently returns a list of all the values as they'd be if submitted, I'd like to make it settable. And maybe even have form.form_values return something that would modify inputs in-place, like form.form_values['Orig'] = '1535 W Leland' mean the equivalent of form.inputs['Orig'].value = '1535 W Leland'.
Hmmm, I already stumbled over the name "form_values" when it actually behaves more like "form_items". This looks like it should be a dictionary-like class, but it's actually more like a hash bag, as parameters can repeat. Those don't seem to have an intuitive mapping to Python idioms, at least not when the most common use case with unique keys is supposed to be convienient. Although, you could actually return a subclass of "list" in form_values that also supports __getitem__ and __setitem__ with string keys. Then, at least, it would be consistent for reading *and* writing. That sounds nicely polymorphic and is sufficiently close to a dict to be helpful in the most common case, but stays mainly a list for the general case. You could then call it "inputitems" to let it match with "inputs" and dicts.
Another option question is actual form submission. Right now it uses urllib. But I like httplib2, for instance, and I'd like it to be possible to use that.
What about a module global setting? You would most likely not want to use both. Alternatively, you could provide a simple interface that takes a URL and a list of name-value pairs and opens it. Then implement it for both libraries and provide an optional keyword argument in submit() that takes a callable function with that signature (or maybe an instance of a dedicated abstract superclass, if you want to make the interface visible).
Also, I'm wondering about how to keep track of the URL when a page is parsed. Stefan mentioned if you use parse(url) it would keep track of that... where? I'd like it to be possible to keep the URL around for any kind of parsing, e.g., with document_fromstring(html, url=X).
You can pass a "base_url" keyword arg to HTML(). If you want to read the original URL, wrap a document in an ElementTree and read its "docinfo.URL" property. Stefan
Stefan Behnel wrote:
So with my last commit you can do things like:
from lxml.html import parse, open_in_browser url = 'http://tripsweb.rtachicago.com/' page = parse(url) page.make_links_absolute(url) form = page.forms[0] form.inputs['Orig'].value = '1535 W Leland' form.inputs['Dest'].value = '847 W Bertrand' res = form.submit() res_page = parse(res) res_page.make_links_absolute(res_page.geturl()) open_in_browser(res_page)
Sounds like you should put something like that into the docs. (hint, hint)
Yeah, I put in docstrings for everything, but it needs more docs to show how it fits together.
It's kind of like Mechanize, only of course better. There's some things I still haven't figured out. Some data structures are convenient, but maybe have some non-obvious aspects. Like form.inputs, which doesn't always return elements (for things like checkboxes it can return something that is more like a logical element).
Have I ever encouraged you to look at objectify? It has special data Elements that behave like normal Python data classes, but are actually objects. Something similar could apply here, you could use a string-like Element for "input" and a boolean-like Element for "checkbox". Hmmm, and radio buttons could be lists?
Although a boolean-like Element always has the disadvantage that bool() would behave different for it than for an in-tree element (i.e.: does it have children?)
Ugh... I don't like that idea at all. Elements aren't strings, or bools, or whatever. Well, elements do have truthiness, but that itself drives me nuts -- I refuse to think of an element with no children as "false". I usually test len(el) == 0 if I want to test for children, just out of a stubborn refusal to consider something like an input element false. Using .value is a bit crude, though easy enough to figure out -- but I'd rather use wrappers to give a more convenient access than override the elements more than they are already overridden. I'm still thinking about how microformat parsing should really work, but I suspect that will also be a wrapper around elements and not something in the elements itself. My intuition is that microformats don't exactly map to elements or to classes. Anyway, a somewhat similar issue.
Also, I'd like to merge in most of the functionality of lxml.html.formfill (except for error-filling), so where form.form_values() currently returns a list of all the values as they'd be if submitted, I'd like to make it settable. And maybe even have form.form_values return something that would modify inputs in-place, like form.form_values['Orig'] = '1535 W Leland' mean the equivalent of form.inputs['Orig'].value = '1535 W Leland'.
Hmmm, I already stumbled over the name "form_values" when it actually behaves more like "form_items". This looks like it should be a dictionary-like class, but it's actually more like a hash bag, as parameters can repeat. Those don't seem to have an intuitive mapping to Python idioms, at least not when the most common use case with unique keys is supposed to be convienient.
Although, you could actually return a subclass of "list" in form_values that also supports __getitem__ and __setitem__ with string keys. Then, at least, it would be consistent for reading *and* writing. That sounds nicely polymorphic and is sufficiently close to a dict to be helpful in the most common case, but stays mainly a list for the general case. You could then call it "inputitems" to let it match with "inputs" and dicts.
In Paste I use an multi-key dict implementation to hold form keys, so that you get something dict-like that doesn't lose information like ordering. It's basically a view over a list of tuples. The implementation is here: http://svn.pythonpaste.org/Paste/trunk/paste/util/multidict.py Unfortunately there's no clear convention for how these kinds of dict-like objects work. I chose to make them as much like normal dicts as possible (so, for instance, if you do d[key] = value, then it's always true that d[key] == value), since most of the time keys are single-valued. But for an actual form I'd like to present the entire form if possible. Like I am now, I guess, with set-like objects and whatnot. And really what form_values gives is intended for urllib.urlencode, and maybe can just be left that way. The order doesn't matter as much to the Python side, as it's just intrinsic in the way the page is laid out. That is, you can't (usually) "make item 4 be (name, value)", because item 4 already has a name, and the value might be constrained anyway. You could say, possibly, "make the second text input with name X have value Y", but that's relatively uncommon in forms and still more constrained than a general dictionary interface. I.e., you can't invent new names, you can't change the order of the fields, and constrained fields like checkboxes stay constrained. So maybe keep form_values, and use something else entirely that is more dict-like for this more dynamic get/set structure. Something a bit like form.inputs, but maybe fully embrace the wrapperness of it. That thing would be more strictly dict-like, and every key would map to some structure that represents the entirety of what represents that key in the form. So a single text input would map to a string. A single checkbox to a boolean (kind of... it's a little fuzzy; it kind of maps to None/the-value-of-the-checkbox, but I could allow a true/false setter as well). Multi-select to a set, etc. Radio buttons would map to a single value, but I'd also want to give some access to the possible set of values (since unlike a text box there is a constrained set of possible values). Right now you get that with form.inputs['radio_name'].value_options, but that won't work with a flatter dictionary. Maybe there'd generally be a form_values.options('field_name'), which would be None for unconstrained, and a set for constrained fields.
Another option question is actual form submission. Right now it uses urllib. But I like httplib2, for instance, and I'd like it to be possible to use that.
What about a module global setting? You would most likely not want to use both.
Alternatively, you could provide a simple interface that takes a URL and a list of name-value pairs and opens it. Then implement it for both libraries and provide an optional keyword argument in submit() that takes a callable function with that signature (or maybe an instance of a dedicated abstract superclass, if you want to make the interface visible).
That's what I was thinking of. I don't like module global settings at all. Passing it in to submit seems fine. I was thinking about using a class variable too, if you wanted to subclass the elements, or just set it manually on a particular instance. Maybe it would be attached to the tree object? E.g.: foo = parse(blah) foo.getroottree().urlfetch = my_url_fetch I was also thinking about whether I should return a new parsed page, or just a file-like, or what. Or a file-like object that has a method to get the page, perhaps; e.g., new_page = form.submit().document(). I don't think the url fetching function would need to do any of this, it would just have a very minimal interface and the submit method would wrap it up in whatever seems most convenient.
Also, I'm wondering about how to keep track of the URL when a page is parsed. Stefan mentioned if you use parse(url) it would keep track of that... where? I'd like it to be possible to keep the URL around for any kind of parsing, e.g., with document_fromstring(html, url=X).
You can pass a "base_url" keyword arg to HTML(). If you want to read the original URL, wrap a document in an ElementTree and read its "docinfo.URL" property.
OK, I guess that keyword argument should be available in all the parsing functions. Maybe I should add a property to elements too, that fetches that information from the tree. And possibly something in parse that uses fp.geturl() if it is available. -- Ian Bicking : ianb@colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers
Ian Bicking wrote:
Stefan Behnel wrote: And really what form_values gives is intended for urllib.urlencode, and maybe can just be left that way. The order doesn't matter as much to the Python side, as it's just intrinsic in the way the page is laid out. That is, you can't (usually) "make item 4 be (name, value)", because item 4 already has a name, and the value might be constrained anyway. You could say, possibly, "make the second text input with name X have value Y", but that's relatively uncommon in forms and still more constrained than a general dictionary interface. I.e., you can't invent new names, you can't change the order of the fields, and constrained fields like checkboxes stay constrained. So maybe keep form_values, and use something else entirely that is more dict-like for this more dynamic get/set structure. Something a bit like form.inputs, but maybe fully embrace the wrapperness of it.
Makes sense to me.
That thing would be more strictly dict-like, and every key would map to some structure that represents the entirety of what represents that key in the form. So a single text input would map to a string.
Sure.
A single checkbox to a boolean (kind of... it's a little fuzzy; it kind of maps to None/the-value-of-the-checkbox, but I could allow a true/false setter as well).
Hmm, except for an empty string value, Python's idea of a truth value would match that. And as you said, changing the form structure is not really intended, so you'd normally not change the value string but rather the "checked" property. So, assigning a truth value would simply change that, whereas a string value could still change the value property. The return value would then be the string value or None. For the special case of an empty string, you could return a string subclass that evaluates to the bool value True. Not sure if I like this, though, sounds like too much magic - and you never know where values end up in in application code... Maybe it's a rare enough corner case to accept this, though. Or isn't there a Unicode character like "zero width space" or something like that, that we could return instead?
Multi-select to a set, etc. Radio buttons would map to a single value, but I'd also want to give some access to the possible set of values (since unlike a text box there is a constrained set of possible values).
Ok, so, how would you set them?
form.inputs["my_radio_name"] = "new_value"
Like this? This would then deselect all other radio buttons with the name "my_radio_name" and only select the one with the "new_value" value. If we adopt this, reading the property should definitely return the selected value as a single string:
form.inputs["my_radio_name"] 'new_value'
Maybe we could return a subclass with an "element" property that returns the Element that carries that value?
form.inputs["my_radio_name"].element <Element 'radio' at ...>
Right now you get that with form.inputs['radio_name'].value_options, but that won't work with a flatter dictionary.
Why not? I actually like that.
Maybe there'd generally be a form_values.options('field_name'), which would be None for unconstrained, and a set for constrained fields.
Sounds too generic for a simple case. You shouldn't forget that you can't really fill a form without knowing what is a radio button and what is a checkbox, so there is not much to gain by providing a generic API. hasattr(el, "value_options") is also easy to write and reads better than el.value_options is None
Another option question is actual form submission. Right now it uses urllib. But I like httplib2, for instance, and I'd like it to be possible to use that.
Alternatively, you could provide a simple interface that takes a URL and a list of name-value pairs and opens it.
That's what I was thinking of. I don't like module global settings at all. Passing it in to submit seems fine. I was thinking about using a class variable too, if you wanted to subclass the elements, or just set it manually on a particular instance. Maybe it would be attached to the tree object? E.g.:
foo = parse(blah) foo.getroottree().urlfetch = my_url_fetch
That wouldn't work, as ElementTrees (and Elements) are not kept alive by the tree, so you can't store state in them.
I was also thinking about whether I should return a new parsed page, or just a file-like, or what. Or a file-like object that has a method to get the page, perhaps; e.g., new_page = form.submit().document(). I don't think the url fetching function would need to do any of this, it would just have a very minimal interface and the submit method would wrap it up in whatever seems most convenient.
You can't return a parsed tree as the server reply can be anything from XML to weird binary. I think a file-like serves most purposes. Maybe an additional "parse()" method would work here, but I don't think it's necessary.
reply_tree = parse(form.submit())
works just fine, is intuitive and avoids overhead.
OK, I guess that keyword argument should be available in all the parsing functions.
"string" parsing functions. Sure.
Maybe I should add a property to elements too, that fetches that information from the tree. And possibly something in parse that uses fp.geturl() if it is available.
etree already does that internally: cdef _getFilenameForFile(source): """Given a Python File or Gzip object, give filename back. Returns None if not a file object. """ # file instances have a name attribute if hasattr(source, 'name'): return source.name # gzip file instances have a filename attribute if hasattr(source, 'filename'): return source.filename # urllib2 if hasattr(source, 'geturl'): return source.geturl() return None Stefan
Stefan Behnel wrote:
A single checkbox to a boolean (kind of... it's a little fuzzy; it kind of maps to None/the-value-of-the-checkbox, but I could allow a true/false setter as well).
Hmm, except for an empty string value, Python's idea of a truth value would match that. And as you said, changing the form structure is not really intended, so you'd normally not change the value string but rather the "checked" property. So, assigning a truth value would simply change that, whereas a string value could still change the value property. The return value would then be the string value or None.
For the special case of an empty string, you could return a string subclass that evaluates to the bool value True. Not sure if I like this, though, sounds like too much magic - and you never know where values end up in in application code... Maybe it's a rare enough corner case to accept this, though. Or isn't there a Unicode character like "zero width space" or something like that, that we could return instead?
The empty string is definitely a corner case, as many server-side languages would treat that as false already. Maybe it could just be returned as True in that case. This could break code that expects a string, but it's such a strange case anyway that I don't mind too much. Or I could return a string subclass of str that is true, which is also very weird, but again it's very much a corner case so maybe it's not that big a deal. If you don't give a value to a checkbox it defaults to "on" anyway, so only an explicit value="" causes this.
Multi-select to a set, etc. Radio buttons would map to a single value, but I'd also want to give some access to the possible set of values (since unlike a text box there is a constrained set of possible values).
Ok, so, how would you set them?
form.inputs["my_radio_name"] = "new_value"
Like this? This would then deselect all other radio buttons with the name "my_radio_name" and only select the one with the "new_value" value. If we adopt this, reading the property should definitely return the selected value as a single string:
form.inputs["my_radio_name"] 'new_value'
Yes, right now it works like: form.inputs['my_radio_name'].value = 'new_value' Where form.inputs['my_radio_name'] is a subclass of list, which contains all the radio input elements and also allows this group setting. If it's a group of checkboxes, it's: form.inputs['my_checkbox_name'].value.add('value1') Which checks the checkbox with the value 'value1'. You can also assign to value, which clears the set and assigns values from the iterator you give. So basically I could take what I have now, and just always get/set .value to create a flatish dictionary. And if you assign directly to the dictionary, it would clear the current values and then update with the values you give, just like the set works. Whether this should replace or augment .inputs, I'm not sure. I think augment, since .inputs gives you access to all the elements, which sometimes you will want.
Maybe we could return a subclass with an "element" property that returns the Element that carries that value?
form.inputs["my_radio_name"].element <Element 'radio' at ...>
Then we have something stringish, but isn't quite a string. And when you an assignment, you get back something that's different than what you assigned. It all feels too magic to me. I think we can just have two accessors, one that gives you elements (like the current form.inputs) and one that gives you values only.
Right now you get that with form.inputs['radio_name'].value_options, but that won't work with a flatter dictionary.
Why not? I actually like that.
You'd also have to augment the string-like object, since form.inputs['radio_name'] would be the value of the currently checked radio button.
Maybe there'd generally be a form_values.options('field_name'), which would be None for unconstrained, and a set for constrained fields.
Sounds too generic for a simple case. You shouldn't forget that you can't really fill a form without knowing what is a radio button and what is a checkbox, so there is not much to gain by providing a generic API.
hasattr(el, "value_options")
is also easy to write and reads better than
el.value_options is None
Yes, most of the time you'll be filling out forms that you expect to have very particular fields. But it's useful generally. With a flat dictionary it's hard to get access to per-field information, so there has to be some other means of access. Anyway, currently value_options is only set on those elements and objects where it makes sense.
Another option question is actual form submission. Right now it uses urllib. But I like httplib2, for instance, and I'd like it to be possible to use that. Alternatively, you could provide a simple interface that takes a URL and a list of name-value pairs and opens it. That's what I was thinking of. I don't like module global settings at all. Passing it in to submit seems fine. I was thinking about using a class variable too, if you wanted to subclass the elements, or just set it manually on a particular instance. Maybe it would be attached to the tree object? E.g.:
foo = parse(blah) foo.getroottree().urlfetch = my_url_fetch
That wouldn't work, as ElementTrees (and Elements) are not kept alive by the tree, so you can't store state in them.
Hrm... that's too bad. I'd like to keep some kind of local information around, ideally inherited as you go from page to page. I really hate global settings.
I was also thinking about whether I should return a new parsed page, or just a file-like, or what. Or a file-like object that has a method to get the page, perhaps; e.g., new_page = form.submit().document(). I don't think the url fetching function would need to do any of this, it would just have a very minimal interface and the submit method would wrap it up in whatever seems most convenient.
You can't return a parsed tree as the server reply can be anything from XML to weird binary. I think a file-like serves most purposes. Maybe an additional "parse()" method would work here, but I don't think it's necessary.
reply_tree = parse(form.submit())
works just fine, is intuitive and avoids overhead.
Yeah, you are probably right. The etree parse method works just fine right now, especially if it already picks up the url. -- Ian Bicking : ianb@colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers
Stefan Behnel wrote:
OK, I guess that keyword argument should be available in all the parsing functions.
"string" parsing functions. Sure.
I'm not sure how to do this. lxml.etree.HTML doesn't take a base_url argument, and root.getroottree().docinfo.URL is read-only. Also, why are all the signatures "..." in help? E.g., help(lxml.etree.HTML) gives "HTML(...)". Is this a Pyrex thing? Perhaps fixable? If not, it would be nice to give signature help in the docstrings. -- Ian Bicking : ianb@colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers
Hi, Ian Bicking wrote:
Stefan Behnel wrote:
OK, I guess that keyword argument should be available in all the parsing functions.
"string" parsing functions. Sure.
I'm not sure how to do this. lxml.etree.HTML doesn't take a base_url argument
It works for me in both trunk and html branch: def HTML(text, _BaseParser parser=None, base_url=None):
Also, why are all the signatures "..." in help? E.g., help(lxml.etree.HTML) gives "HTML(...)". Is this a Pyrex thing? Perhaps fixable? If not, it would be nice to give signature help in the docstrings.
Yes, that's a Pyrex (or rather C) thing. The signature is not visible in C modules. I tried to make Pyrex add signatures to the docstrings automatically, but that turned out to be harder than I thought and I didn't have the time to get it right since then. I attached a patch that gets you part of the way. I started experimenting with epydoc and it can actually read a signature line that you prepend to docstrings, so doing that would give us nicely formatted HTML docs. http://codespeak.net/lxml/dev/api/ Stefan
Stefan Behnel wrote:
Hi,
Ian Bicking wrote:
Stefan Behnel wrote:
OK, I guess that keyword argument should be available in all the parsing functions. "string" parsing functions. Sure. I'm not sure how to do this. lxml.etree.HTML doesn't take a base_url argument
It works for me in both trunk and html branch:
def HTML(text, _BaseParser parser=None, base_url=None):
Really? Here's what I get on the html branch:
from lxml.etree import HTML h = HTML('<span>', None, 'http://foo.com') Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: function takes at most 2 arguments (3 given) h = HTML('<span>', base_url='http://foo.com') Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: 'base_url' is an invalid keyword argument for this function
-- Ian Bicking : ianb@colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers
Ian Bicking wrote:
Stefan Behnel wrote:
Ian Bicking wrote:
Stefan Behnel wrote:
OK, I guess that keyword argument should be available in all the parsing functions. "string" parsing functions. Sure. I'm not sure how to do this. lxml.etree.HTML doesn't take a base_url argument
It works for me in both trunk and html branch:
def HTML(text, _BaseParser parser=None, base_url=None):
Really? Here's what I get on the html branch:
from lxml.etree import HTML h = HTML('<span>', None, 'http://foo.com') Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: function takes at most 2 arguments (3 given) h = HTML('<span>', base_url='http://foo.com') Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: 'base_url' is an invalid keyword argument for this function
Believe me:
import lxml.etree as et et.HTML("<html>", None, "oh") <Element html at b7947b44> et.HTML("<html>", base_url="oh") <Element html at b7947b94>
$ LANG=en_GB svn info src/lxml/etree.pyx Path: src/lxml/etree.pyx Name: etree.pyx URL: https://scoder@codespeak.net/svn/lxml/branch/html/src/lxml/etree.pyx [...] Revision: 45164 Node Kind: file Schedule: normal Last Changed Author: scoder Last Changed Rev: 44837 Last Changed Date: 2007-07-08 09:54:56 +0200 (Sun, 08 Jul 2007) Text Last Updated: 2007-07-16 15:01:37 +0200 (Mon, 16 Jul 2007) Checksum: 9b183d6891d5e5f3606cd13350b582ad Have you rebuilt lxml.etree lately? Or do you have an installed egg version that takes precedence? Stefan
Stefan Behnel wrote:
Have you rebuilt lxml.etree lately? Or do you have an installed egg version that takes precedence?
Ah, you are right, I had not rebuilt lately. python setup.py develop got it back in order. -- Ian Bicking : ianb@colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers
participants (2)
-
Ian Bicking -
Stefan Behnel