Parsing/Crawler Questions - solution

lkcl luke.leighton at
Sun Mar 8 10:22:48 CET 2009

On Mar 7, 9:56 pm, "bruce" <bedoug... at> wrote:
> ....
> and this solution will somehow allow a user to create a web parsing/scraping
> app for parising links, and javascript from a web page?

 not just parsing the links and the "static" javascript, but:

 * actually executing the javascript, giving the quotes page quotes a
chance to actually _look_ like it would if it was being viewed as a
quotes real quotes web browser.

 so any XMLHTTPRequests will _actually_ get executed, _actually_
result in _actually_ having the content of the web page _properly_

 so, e.g instead of seeing a "Loader" page on gmail you would
_actually_ see the user's email and the adverts (assuming you went to
the trouble of putting in the username/password) because the AJAX
would _actually_ get executed by the WebKit engine, and the DOM model
accessed thereafter.

 * giving the user the opportunity to call DOM methods such as
getElementsByTagName and the opportunity to access properties such as

  in webkit-glib "gdom" bindings, that would be:

 * anchor_list = gdom_document_get_elements_by_tag_name(doc, "a");


 * g_object_get(doc, "anchors", &anchor_list, NULL);

  which in pywebkitgtk (thanks to python-pygobject auto-generation of
python bindings from gobject bindings) translates into:

 * doc.get_elements_by_tag_name("a")


 * doc.props.anchors

  which in pyjamas-desktop, a high-level abstraction on top of _that_,
turns into:

 * from pyjamas import DOM
   anchor_list = DOM.getElementsByTagName(doc, "a")


 * from pyjamas import DOM
   anchor_list = DOM.getAttribute(doc, "anchors")

answer: yes.


> -----Original Message-----
> From: python-list-bounces+bedouglas=earthlink.... at
> [mailto:python-list-bounces+bedouglas=earthlink.... at]On Behalf
> Oflkcl
> Sent: Saturday, March 07, 2009 2:34 AM
> To: python-l... at
> Subject: Re: Parsing/Crawler Questions - solution
> On Mar 7, 12:19 am, rounderwe... at wrote:
> > So, it sounds like your update means that it is related to a specific
> > url.
> > I'm curious about this issue myself.  I've often wondered how one
> > could properly crawl anAJAX-ish site when you're not sure how quickly
> > the data will be returned after the page has been.
>  you want to look at the webkit engine - no not the graphical browser
> - the ParseTree example - and combine it with pywebkitgtk - no not the
> "original" version, the one which has DOM-manipulation bindings
> through webkit-glib.
> the webkit parse tree example is, despite it being based on the GTK
> "port" as they like to call it in webkit (which just means that it
> links with GTK not QT4 or wxWidgets), is a console-based application.
> in other words, despite it being GTK, it still does NOT output
> graphical crap to the screen, yet it still *executes* the javascript
> on the page.
> dummy functions for "mouse", "keyboard", "console errors" are given as
> examples and are left as an exercise for the application writer to
> fill-in-the-blanks.
> combining this parse tree example with pywebkitgtk (see
> would provide a means by which web pages can be
> executed AT THE CONSOLE NOT AS A GUI APP, then, thanks to the glib /
> gobject bindings, a python app will be able to walk the DOM tree as
> expected.
> i _just_ fixed pyjamas-desktop's iterators in the pyjamas.DOM module
> for someone, on the pyjamas-dev mailing list.
> dd014540
> so, actually, you may be better off starting from pyjamas-desktop and
> then cutting out the "fire up the GTK window" bit, from
> is based on pywebkitgtk's
> the alternative to webkit is to use python-hulahop - it will do the
> same thing, but just using python bindings to gecko instead of python-
> bindings-to-glib-bindings-to-webkit.
> l.
> --

More information about the Python-list mailing list