How do I enter/receive webpage information?
John J. Lee
jjl at pobox.com
Sat Feb 5 17:58:52 EST 2005
Jorgen Grahn <jgrahn-nntq at algonet.se> writes:
[...]
> I did it this way successfully once ... it's probably the wrong approach in
> some ways, but It Works For Me.
>
> - used httplib.HTTPConnection for the HTTP parts, building my own requests
> with headers and all, calling h.send() and h.getresponse() etc.
>
> - created my own cookie container class (because there was a session
> involved, and logging in and such things, and all of it used cookies)
>
> - subclassed sgmllib.SGMLParser once for each kind of page I expected to
> receive. This class knew how to pull the information from a HTML document,
> provided it looked as I expected it to. Very tedious work. It can be easier
> and safer to just use module re in some cases.
>
> Wrapped in classes this ended up as (fictive):
>
> client = Client('somehost:80)
> client.login('me', 'secret)
> a, b = theAsAndBs(client, 'tomorrow', 'Wiltshire')
> foo = theFoo(client, 'yesterday')
>
> I had to look deeply into the HTTP RFCs to do this, and also snoop the
> traffic for a "real" session to see what went on between server and client.
I see little benefit and significant loss in using httplib instead of
urllib2, unless and until you get a particulary stubborn problem and
want to drop down a level to debug. It's easy to see and modify
urllib2's headers if you need to get low level.
One starting point for web scraping with Python:
http://wwwsearch.sourceforge.net/bits/GeneralFAQ.html
There are some modules you may find useful there, too.
Google Groups for urlencode. Or use my module ClientForm, if you
prefer. Experiment a little with an HTML form in a local file and
(eg.) the 'ethereal' sniffer to see what happens when you click
submit.
The stdlib now has cookie support (in Python 2.4):
import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
r = opener.open("http://example.com/")
print r.read()
Unfortunately, it's true that network sniffing and a reasonable
smattering of knowledge about HTTP &c., does often turn out to be
necessary to scrape stuff. A few useful tips:
http://wwwsearch.sourceforge.net/ClientCookie/doc.html#debugging
John
More information about the Python-list
mailing list