Parsing/Crawler Questions - solution
luke.leighton at googlemail.com
Sun Mar 8 10:22:48 CET 2009
On Mar 7, 9:56 pm, "bruce" <bedoug... at earthlink.net> wrote:
> and this solution will somehow allow a user to create a web parsing/scraping
chance to actually _look_ like it would if it was being viewed as a
quotes real quotes web browser.
so any XMLHTTPRequests will _actually_ get executed, _actually_
result in _actually_ having the content of the web page _properly_
so, e.g instead of seeing a "Loader" page on gmail you would
_actually_ see the user's email and the adverts (assuming you went to
the trouble of putting in the username/password) because the AJAX
would _actually_ get executed by the WebKit engine, and the DOM model
* giving the user the opportunity to call DOM methods such as
getElementsByTagName and the opportunity to access properties such as
in webkit-glib "gdom" bindings, that would be:
* anchor_list = gdom_document_get_elements_by_tag_name(doc, "a");
* g_object_get(doc, "anchors", &anchor_list, NULL);
which in pywebkitgtk (thanks to python-pygobject auto-generation of
python bindings from gobject bindings) translates into:
which in pyjamas-desktop, a high-level abstraction on top of _that_,
* from pyjamas import DOM
anchor_list = DOM.getElementsByTagName(doc, "a")
* from pyjamas import DOM
anchor_list = DOM.getAttribute(doc, "anchors")
> -----Original Message-----
> From: python-list-bounces+bedouglas=earthlink.... at python.org
> [mailto:python-list-bounces+bedouglas=earthlink.... at python.org]On Behalf
> Sent: Saturday, March 07, 2009 2:34 AM
> To: python-l... at python.org
> Subject: Re: Parsing/Crawler Questions - solution
> On Mar 7, 12:19 am, rounderwe... at gmail.com wrote:
> > So, it sounds like your update means that it is related to a specific
> > url.
> > I'm curious about this issue myself. I've often wondered how one
> > could properly crawl anAJAX-ish site when you're not sure how quickly
> > the data will be returned after the page has been.
> you want to look at the webkit engine - no not the graphical browser
> - the ParseTree example - and combine it with pywebkitgtk - no not the
> "original" version, the one which has DOM-manipulation bindings
> through webkit-glib.
> the webkit parse tree example is, despite it being based on the GTK
> "port" as they like to call it in webkit (which just means that it
> links with GTK not QT4 or wxWidgets), is a console-based application.
> in other words, despite it being GTK, it still does NOT output
> on the page.
> dummy functions for "mouse", "keyboard", "console errors" are given as
> examples and are left as an exercise for the application writer to
> combining this parse tree example with pywebkitgtk (see
> demobrowser.py) would provide a means by which web pages can be
> executed AT THE CONSOLE NOT AS A GUI APP, then, thanks to the glib /
> gobject bindings, a python app will be able to walk the DOM tree as
> i _just_ fixed pyjamas-desktop's iterators in the pyjamas.DOM module
> for someone, on the pyjamas-dev mailing list.
> so, actually, you may be better off starting from pyjamas-desktop and
> then cutting out the "fire up the GTK window" bit, from pyjd.py.
> pyjd.py is based on pywebkitgtk's demobrowser.py
> the alternative to webkit is to use python-hulahop - it will do the
> same thing, but just using python bindings to gecko instead of python-
More information about the Python-list