ANN: DOMForm 0.0.1a released

John J. Lee jjl@pobox.com
05 Oct 2003 20:53:34 +0100


http://wwwsearch.sourceforge.net/DOMForm/

This is the first release.  There are many known bugs, and interfaces
will change.  Most of the bugs are in the JavaScript support.  The
ClientForm work-alike stuff is relatively stable (but see the entities
and select_default bugs listed in the known bugs list on the web
page).  Feedback is welcome, but do check the list of known bugs on
the web page.

Requires Python 2.3 and PyXML 0.8.3 (earlier versions may work, but
are untested).  Currently mxTidy is required (I may switch to uTidylib
at some point).  The spidermonkey Python module is required if you
want JavaScript interpretation.


DOMForm is a Python module for web scraping and web testing.  It knows
how to evaluate embedded JavaScript code in response to appropriate
events.  DOMForm supports both the ClientForm HTML form interface and
the HTML DOM level 2 interface (note that ATM the DOM is written to an
out-of-date version of the specification, and has some hacks to get it
to work with 'DOM as deployed').  The ClientForm interface makes it
easy to parse HTML forms, fill them in and return them to the server.
The DOM interface makes it easy to get at other parts of the document,
and makes JavaScript support possible.  The ability to switch back and
forth between the two interfaces allows simpler code than would result
from using either interface alone.  DOMForm is partly derived from
several third-party libraries.  The JavaScript support currently
depends on Mozilla's GPLed spidermonkey JavaScript interpreter (which
is available separately from Mozilla itself), and a Python interface
to spidermonkey.

Simple example:

from urllib2 import urlopen
from DOMForm import ParseResponse

response = urlopen("http://www.example.com/")
window = ParseResponse(response)
window.document  # HTML DOM Level 2 HTMLDocument interface
forms = window._htmlforms  # list of objects supporting ClientForm.HTMLForm i/face
form = forms[0]

assert form.name == "some_form"
domform = form.node  # level 2 HTML DOM HTMLFormElement interface
control = form.find_control("some_control")  # ClientForm.Control i/face
domcontrol = control.node  # corresponding level 2 HTML DOM HTMLElement i/face
doc.some_form._htmlform  # back to the ClientForm.HTMLForm interface again
doc.some_form.some_control._control  # ClientForm.Control interface again

response = urlopen(form.click())  # domform.submit() also works


John