Fundamental problem with urllib...
Jeff Pitman
bruthasj at yahoo.com
Thu Apr 25 20:03:07 EDT 2002
Steve Holden wrote:
> "Jeremy Hylton" <jeremy at alum.mit.edu> wrote ...
>> "A.M. Kuchling" <akuchlin at ute.mems-exchange.org> wrote ...
>> > In article <yNUw8.74422$T%5.18813 at atlpnn01.usenetserver.com>,
>> > Steve Holden wrote:
>> > > Since urllib knows nothing of cookies, you will need to integrate
>> > > some
> sort
>> > > of a cookie jar into the library, with a new API for the clients to
>> > > retrieve and store the cookies.
Or do it transparently.
>> > This is worthwhile, but I don't think it belongs in urllib. It
>> > belongs in a module or package of its own that provides general
>> > Web-browser features such as cookies, remembering authentication
>> > usernames and passwords, and a cache. This package could then be used
>> > for implementing HTML-scraping scripts, spiders, or a Web browser.
No, it belongs in urllib2 because it uses recursion to open websites that,
for example, redirect from http://site/index.php to
http://site/login_page.php to https://site/login_page.php. All I can say
is "good luck!" trying to intervene outside of the package without
rewriting AbstractHTTPHandler.
>> I'm not sure what the difference between an HTTP client, like urllib
>> or urllib2, and a Web-browser is. Other than urllib's monolithic
>> design, why wouldn't you want these sorts of features in the module?
Exactly!
> Personally I imagined passing a dictionary as an optional cookie jar
> argument, keyed by (domain, path) tuples. The library code would update
> this as dictated by its interactions with web sources.
This is what I did:
http://sourceforge.net/tracker/index.php?func=detail&aid=548197&group_id=5470&atid=305470
And, right now, it is transparent to anyone using urllib2. It uses a
persistent Dict (within a script who imports urllib2) that keys on the
hostname and stores a "Cookie" object. This Cookie object is scraped every
time Cookies are sent from the server to the client. The Cookie object is
then consulted and its headers sent on each request to the server.
Obviously this is v0.0.0.1, but I think it is a start of something useable.
I'm trying to create a library that "HTML-scrapes" websites and while doing
so I hit a brick wall with Cookies. This library is going to be similar to
related perl scripts, except much cleaner.
Sample screen-scrape:
ua = HTMLAgent( "http://www.yahoo.com/" )
ua.report()
form = ua.getFormByIndex( 0 )
form.fill( 'q', 'python' )
ua.submit( form )
ua.report()
ua.clickByName( 'Python Language Website' )
print "Now at", ua.location.geturl()
So far so good with the library I've written so far. Except it can be slow
as it uses minidom to parse the HTML. And, I'm a newbie at this stuff so,
I don't know how to limit the parse to only <form></form> tags in the
screen-scrape process.
I'll clean it up a little tonight and drop it somewhere if you want to look
at it.
-jeff
More information about the Python-list
mailing list