urllib equivalent for HTTP requests

Diez B. Roggisch deets at nospam.web.de
Wed Oct 8 03:34:40 EDT 2008


K schrieb:
> Hello everyone,
> 
> I understand that urllib and urllib2 serve as really simple page
> request libraries. I was wondering if there is a library out there
> that can get the HTTP requests for a given page.
> 
> Example:
> URL: http://www.google.com/test.html
> 
> Something like: urllib.urlopen('http://www.google.com/
> test.html').files()
> 
> Lists HTTP Requests attached to that URL:
> => http://www.google.com/test.html
> => http://www.google.com/css/google.css
> => http://www.google.com/js/js.css


There are no "Requests attached" to an url. There is a HTML-document 
behind it, that might contain further external references.

> The other fun part is the inclusion of JS within <script> tags, i.e.
> the new Google Analytics script
> => http://www.google-analytics.com/ga.js
> 
> or css, @imports
> => http://www.google.com/css/import.css
> 
> I would like to keep track of that but I realize that py does not have
> a JS engine. :( Anyone with ideas on how to track these items or am I
> out of luck.

You can use e.g. BeautifulSoup to extract all links from the site.

What you can't do though is to get the requests that are issued by 
Javascript that is *running*.

Diez



More information about the Python-list mailing list