parsing javascript

Mon Oct 13 04:46:42 EDT 2008

On Oct 12, 2:28 pm, Philip Semanchuk <phi... at semanchuk.com> wrote:
> On Oct 12, 2008, at 5:25 AM, S.SelvamSivawrote:
>
> > I have to do a parsing on webpagesand fetch urls.My problem is ,many
> > urls i
> > need to parse are dynamically loaded usingjavascriptfunction
> > (onload()).How to fetch those links from python? Thanks in advance.
>
> Selvam,
> You can try to find them yourself using string parsing, but that's
> difficult. The closer you want to get to "perfect" at finding URLs
> expressed in JS, the closer you'll get to rewriting a JS interpreter.
> For instance, this is not so hard to understand:
>     "http://example.com/"
> but this is:
>     "http://ZZZ_DOMAIN_ZZZ/index.html".replace(/ZZZ_DOMAIN_ZZZ/,
> the_domain_variable)
>
> This is a long-standing problem for any program that parses Web pages.

 yep :)

> You either have to embed a JS interpreter in your application or

 yep.

 there are several.

 pyv8 is the newest addition: http://advogato.org/article/985.html

 it's a python wrapper around google's v8 javascript execution
library.

 then there's pykhtml: http://paul.giannaros.org/pykhtml/

 it's a python wrapper around KHTML, providing very convenient access
to KDE's HTML capabilities: what pykhtml does is "pretends" that the
GUI part of KDE doesn't exist, so you can run your program as a
command-line shell; it will execute the javascript, which you will
have to wait a bit for of course; then you can walk the DOM tree
(using pykhtml bindings) using pykhtml.DOM.getElementById() and
getElementsByTagName("a") etc. etc. looking for the URLs.

 there's even an AJAX example included which does 1-second polling of
the DOM model, waiting for a spell-checking web site to deliver the
answer.

then there's webkit, with the new glib bindings:
https://bugs.webkit.org/show_bug.cgi?id=16401

which are then followed up by python bindings to _those_ bindings:
http://code.google.com/p/pywebkitgtk/issues/detail?id=13

this will also allow you to execute arbitrary javascript - again, it's
similar to KHTML and in fact webkit really _is_ the KDE KHTML code
(JavaScriptCore, KJS etc) but forked, improved, etc. etc.

unfortunately, the glib bindings are tied - at three key and strategic
locations - to gtk at the moment, which will take _very_ little work
to "un"tie them [pay me and i'll do the work], so you would need to
create a blank gtk window - just like is done with pykhtml, behind the
scenes.

it would be a very simple task to create a "dummy" - console-based -
port of webkit, providing an array of callbacks which you must hand to
the library.  at the moment, the design of webkit is not particularly
good in this respect: there are three ports, gtk, wx and qt, which are
heavily tied in to webkit.  it would be a _far_ better design to be
passing in a struct containing function callbacks (rather a lot of
them - about eighty!) and then what you could do is have a "console"-
based port of webkit, which would do the job you needed.

alternatively, if you don't mind wrapping a binary application with
e.g. Popen3 then look at the webkit DumpRenderTree application, paying
particular attention to using the --html option.  you won't have any
control over how long the javascript is executed for.  after an
arbitrary and small period of time, DumpRenderTree _stops_ executing
the javascript and prints out the HTML DOM model (in a non-html-layout
fashion - it's used for debugging and testing purposes but will
suffice for your purposes).

so, as it stands, pywebkitgtk is _no worse_ than pykhtml, but with a
little bit of tweaking, the "gtk" could be removed from "pywebkitgtk"
and you'd end up with... ohh... call it "pywebkitglib" ... which would
be much better as a stand-alone library, for your purposes

then there's also "spidermonkey", which is mozilla's javascript
engine.  i haven't investigated this option: haven't had a need to.

then there's also PyXPCOMExt, which is embedding python into mozilla,
and from there you have PyDOM, which allows you access to the DOM
model of the mozilla "thing".  so, if you don't mind embedding your
application into XULRunner, you've got a home for executing your app
and obtaining the urls, post-javascript-execution.

the neat thing about PyXPCOMExt is that you have complete and full
access to python - so your app can make external TCP and UDP sockets,
you can embed an entire _server_ in the damn thing if you want (you
could embed... python-twisted if you wanted!)  you can access the
filesystem - anything.  absolutely anything.  reason: the _entire_
python suite is embedded into the browser.  every single bit of it.

that's about all i've been able to find, so far.  there might be more
options out there.  not that there aren't enough already :)

all of them will allow you complete and full access to execution of
javascript, including AJAX execution.  which is why you'll need to do
that "polling" trick in many instances.

l.