Parsing/Crawler Questions - solution

bruce bedouglas at earthlink.net
Sat Mar 7 16:56:09 EST 2009


....

and this solution will somehow allow a user to create a web parsing/scraping
app for parising links, and javascript from a web page?


-----Original Message-----
From: python-list-bounces+bedouglas=earthlink.net at python.org
[mailto:python-list-bounces+bedouglas=earthlink.net at python.org]On Behalf
Of lkcl
Sent: Saturday, March 07, 2009 2:34 AM
To: python-list at python.org
Subject: Re: Parsing/Crawler Questions - solution


On Mar 7, 12:19 am, rounderwe... at gmail.com wrote:
> So, it sounds like your update means that it is related to a specific
> url.
>
> I'm curious about this issue myself.  I've often wondered how one
> could properly crawl anAJAX-ish site when you're not sure how quickly
> the data will be returned after the page has been.

 you want to look at the webkit engine - no not the graphical browser
- the ParseTree example - and combine it with pywebkitgtk - no not the
"original" version, the one which has DOM-manipulation bindings
through webkit-glib.

the webkit parse tree example is, despite it being based on the GTK
"port" as they like to call it in webkit (which just means that it
links with GTK not QT4 or wxWidgets), is a console-based application.

in other words, despite it being GTK, it still does NOT output
graphical crap to the screen, yet it still *executes* the javascript
on the page.

dummy functions for "mouse", "keyboard", "console errors" are given as
examples and are left as an exercise for the application writer to
fill-in-the-blanks.

combining this parse tree example with pywebkitgtk (see
demobrowser.py) would provide a means by which web pages can be
executed AT THE CONSOLE NOT AS A GUI APP, then, thanks to the glib /
gobject bindings, a python app will be able to walk the DOM tree as
expected.

i _just_ fixed pyjamas-desktop's iterators in the pyjamas.DOM module
for someone, on the pyjamas-dev mailing list.


http://github.com/lkcl/pyjamas-desktop/tree/8ed365b89efe5d1d3451c3e3ced662a2
dd014540

so, actually, you may be better off starting from pyjamas-desktop and
then cutting out the "fire up the GTK window" bit, from pyjd.py.

pyjd.py is based on pywebkitgtk's demobrowser.py

the alternative to webkit is to use python-hulahop - it will do the
same thing, but just using python bindings to gecko instead of python-
bindings-to-glib-bindings-to-webkit.


l.
--
http://mail.python.org/mailman/listinfo/python-list




More information about the Python-list mailing list