Parsing/Crawler Questions - solution
bedouglas at earthlink.net
Sat Mar 7 22:56:09 CET 2009
and this solution will somehow allow a user to create a web parsing/scraping
From: python-list-bounces+bedouglas=earthlink.net at python.org
[mailto:python-list-bounces+bedouglas=earthlink.net at python.org]On Behalf
Sent: Saturday, March 07, 2009 2:34 AM
To: python-list at python.org
Subject: Re: Parsing/Crawler Questions - solution
On Mar 7, 12:19 am, rounderwe... at gmail.com wrote:
> So, it sounds like your update means that it is related to a specific
> I'm curious about this issue myself. I've often wondered how one
> could properly crawl anAJAX-ish site when you're not sure how quickly
> the data will be returned after the page has been.
you want to look at the webkit engine - no not the graphical browser
- the ParseTree example - and combine it with pywebkitgtk - no not the
"original" version, the one which has DOM-manipulation bindings
the webkit parse tree example is, despite it being based on the GTK
"port" as they like to call it in webkit (which just means that it
links with GTK not QT4 or wxWidgets), is a console-based application.
in other words, despite it being GTK, it still does NOT output
on the page.
dummy functions for "mouse", "keyboard", "console errors" are given as
examples and are left as an exercise for the application writer to
combining this parse tree example with pywebkitgtk (see
demobrowser.py) would provide a means by which web pages can be
executed AT THE CONSOLE NOT AS A GUI APP, then, thanks to the glib /
gobject bindings, a python app will be able to walk the DOM tree as
i _just_ fixed pyjamas-desktop's iterators in the pyjamas.DOM module
for someone, on the pyjamas-dev mailing list.
so, actually, you may be better off starting from pyjamas-desktop and
then cutting out the "fire up the GTK window" bit, from pyjd.py.
pyjd.py is based on pywebkitgtk's demobrowser.py
the alternative to webkit is to use python-hulahop - it will do the
same thing, but just using python bindings to gecko instead of python-
More information about the Python-list