Fundamental problem with urllib...

Thu Apr 25 20:03:07 EDT 2002

Steve Holden wrote:

> "Jeremy Hylton" <jeremy at alum.mit.edu> wrote ...
>> "A.M. Kuchling" <akuchlin at ute.mems-exchange.org> wrote ...
>> > In article <yNUw8.74422$T%5.18813 at atlpnn01.usenetserver.com>,
>> > Steve Holden wrote:
>> > > Since urllib knows nothing of cookies, you will need to integrate
>> > > some
> sort
>> > > of a cookie jar into the library, with a new API  for the clients to
>> > > retrieve and store the cookies.

Or do it transparently.

>> > This is worthwhile, but I don't think it belongs in urllib.  It
>> > belongs in a module or package of its own that provides general
>> > Web-browser features such as cookies, remembering authentication
>> > usernames and passwords, and a cache.  This package could then be used
>> > for implementing HTML-scraping scripts, spiders, or a Web browser.

No, it belongs in urllib2 because it uses recursion to open websites that, 
for example, redirect from http://site/index.php to 
http://site/login_page.php to  https://site/login_page.php.  All I can say 
is "good luck!" trying to intervene outside of the package without 
rewriting AbstractHTTPHandler.

>> I'm not sure what the difference between an HTTP client, like urllib
>> or urllib2, and a Web-browser is.  Other than urllib's monolithic
>> design, why wouldn't you want these sorts of features in the module?

Exactly!

> Personally I imagined passing a dictionary as an optional cookie jar
> argument, keyed by (domain, path) tuples. The library code would update
> this as dictated by its interactions with web sources.

This is what I did:

http://sourceforge.net/tracker/index.php?func=detail&aid=548197&group_id=5470&atid=305470

And, right now, it is transparent to anyone using urllib2.  It uses a 
persistent Dict (within a script who imports urllib2) that keys on the 
hostname and stores a "Cookie" object.  This Cookie object is scraped every 
time Cookies are sent from the server to the client.  The Cookie object is 
then consulted and its headers sent on each request to the server.

Obviously this is v0.0.0.1, but I think it is a start of something useable.  
I'm trying to create a library that "HTML-scrapes" websites and while doing 
so I hit a brick wall with Cookies.  This library is going to be similar to 
related perl scripts, except much cleaner.

Sample screen-scrape:

    ua = HTMLAgent( "http://www.yahoo.com/" )
    ua.report()

    form = ua.getFormByIndex( 0 )
    form.fill( 'q', 'python' )
    ua.submit( form )

    ua.report()
    ua.clickByName( 'Python Language Website' )

    print "Now at", ua.location.geturl()

So far so good with the library I've written so far. Except it can be slow 
as it uses minidom to parse the HTML.  And, I'm a newbie at this stuff so, 
I don't know how to limit the parse to only <form></form> tags in the 
screen-scrape process.

I'll clean it up a little tonight and drop it somewhere if you want to look 
at it.

-jeff