Parsing/Crawler Questions - solution

bruce bedouglas at
Sat Mar 7 22:56:09 CET 2009


and this solution will somehow allow a user to create a web parsing/scraping
app for parising links, and javascript from a web page?

-----Original Message-----
From: at
[ at]On Behalf
Of lkcl
Sent: Saturday, March 07, 2009 2:34 AM
To: python-list at
Subject: Re: Parsing/Crawler Questions - solution

On Mar 7, 12:19 am, rounderwe... at wrote:
> So, it sounds like your update means that it is related to a specific
> url.
> I'm curious about this issue myself.  I've often wondered how one
> could properly crawl anAJAX-ish site when you're not sure how quickly
> the data will be returned after the page has been.

 you want to look at the webkit engine - no not the graphical browser
- the ParseTree example - and combine it with pywebkitgtk - no not the
"original" version, the one which has DOM-manipulation bindings
through webkit-glib.

the webkit parse tree example is, despite it being based on the GTK
"port" as they like to call it in webkit (which just means that it
links with GTK not QT4 or wxWidgets), is a console-based application.

in other words, despite it being GTK, it still does NOT output
graphical crap to the screen, yet it still *executes* the javascript
on the page.

dummy functions for "mouse", "keyboard", "console errors" are given as
examples and are left as an exercise for the application writer to

combining this parse tree example with pywebkitgtk (see would provide a means by which web pages can be
executed AT THE CONSOLE NOT AS A GUI APP, then, thanks to the glib /
gobject bindings, a python app will be able to walk the DOM tree as

i _just_ fixed pyjamas-desktop's iterators in the pyjamas.DOM module
for someone, on the pyjamas-dev mailing list.

so, actually, you may be better off starting from pyjamas-desktop and
then cutting out the "fire up the GTK window" bit, from is based on pywebkitgtk's

the alternative to webkit is to use python-hulahop - it will do the
same thing, but just using python bindings to gecko instead of python-


More information about the Python-list mailing list