Parsing/Crawler Questions - solution

lkcl luke.leighton at googlemail.com
Sun Mar 8 10:22:48 CET 2009


On Mar 7, 9:56 pm, "bruce" <bedoug... at earthlink.net> wrote:
> ....
>
> and this solution will somehow allow a user to create a web parsing/scraping
> app for parising links, and javascript from a web page?


 not just parsing the links and the "static" javascript, but:

 * actually executing the javascript, giving the quotes page quotes a
chance to actually _look_ like it would if it was being viewed as a
quotes real quotes web browser.

 so any XMLHTTPRequests will _actually_ get executed, _actually_
result in _actually_ having the content of the web page _properly_
modified.

 so, e.g instead of seeing a "Loader" page on gmail you would
_actually_ see the user's email and the adverts (assuming you went to
the trouble of putting in the username/password) because the AJAX
would _actually_ get executed by the WebKit engine, and the DOM model
accessed thereafter.


 * giving the user the opportunity to call DOM methods such as
getElementsByTagName and the opportunity to access properties such as
document.anchors.

  in webkit-glib "gdom" bindings, that would be:

 * anchor_list = gdom_document_get_elements_by_tag_name(doc, "a");

or

 * g_object_get(doc, "anchors", &anchor_list, NULL);

  which in pywebkitgtk (thanks to python-pygobject auto-generation of
python bindings from gobject bindings) translates into:

 * doc.get_elements_by_tag_name("a")

or

 * doc.props.anchors

  which in pyjamas-desktop, a high-level abstraction on top of _that_,
turns into:

 * from pyjamas import DOM
   anchor_list = DOM.getElementsByTagName(doc, "a")

or

 * from pyjamas import DOM
   anchor_list = DOM.getAttribute(doc, "anchors")

answer: yes.

l.

> -----Original Message-----
> From: python-list-bounces+bedouglas=earthlink.... at python.org
>
> [mailto:python-list-bounces+bedouglas=earthlink.... at python.org]On Behalf
> Oflkcl
> Sent: Saturday, March 07, 2009 2:34 AM
> To: python-l... at python.org
> Subject: Re: Parsing/Crawler Questions - solution
>
> On Mar 7, 12:19 am, rounderwe... at gmail.com wrote:
> > So, it sounds like your update means that it is related to a specific
> > url.
>
> > I'm curious about this issue myself.  I've often wondered how one
> > could properly crawl anAJAX-ish site when you're not sure how quickly
> > the data will be returned after the page has been.
>
>  you want to look at the webkit engine - no not the graphical browser
> - the ParseTree example - and combine it with pywebkitgtk - no not the
> "original" version, the one which has DOM-manipulation bindings
> through webkit-glib.
>
> the webkit parse tree example is, despite it being based on the GTK
> "port" as they like to call it in webkit (which just means that it
> links with GTK not QT4 or wxWidgets), is a console-based application.
>
> in other words, despite it being GTK, it still does NOT output
> graphical crap to the screen, yet it still *executes* the javascript
> on the page.
>
> dummy functions for "mouse", "keyboard", "console errors" are given as
> examples and are left as an exercise for the application writer to
> fill-in-the-blanks.
>
> combining this parse tree example with pywebkitgtk (see
> demobrowser.py) would provide a means by which web pages can be
> executed AT THE CONSOLE NOT AS A GUI APP, then, thanks to the glib /
> gobject bindings, a python app will be able to walk the DOM tree as
> expected.
>
> i _just_ fixed pyjamas-desktop's iterators in the pyjamas.DOM module
> for someone, on the pyjamas-dev mailing list.
>
> http://github.com/lkcl/pyjamas-desktop/tree/8ed365b89efe5d1d3451c3e3c...
> dd014540
>
> so, actually, you may be better off starting from pyjamas-desktop and
> then cutting out the "fire up the GTK window" bit, from pyjd.py.
>
> pyjd.py is based on pywebkitgtk's demobrowser.py
>
> the alternative to webkit is to use python-hulahop - it will do the
> same thing, but just using python bindings to gecko instead of python-
> bindings-to-glib-bindings-to-webkit.
>
> l.
> --http://mail.python.org/mailman/listinfo/python-list




More information about the Python-list mailing list