How do I enter/receive webpage information?
Jorgen Grahn
jgrahn-nntq at algonet.se
Sat Feb 5 12:41:25 EST 2005
On 4 Feb 2005 15:33:50 -0800, Mudcat <mnations at gmail.com> wrote:
> Hi,
>
> I'm wondering the best way to do the following.
>
> I would like to use a map webpage (like yahoo maps) to find the
> distance between two places that are pulled in from a text file. I want
> to accomplish this without displaying the browser.
That's called "web scraping", in case you want to Google for info.
> I am looking at several options right now, including urllib, httplib,
> packet trace, etc. But I don't know where to start with it or if there
> are existing tools that I could incorporate.
>
> Can someone explain how to do this or point me in the right direction?
I did it this way successfully once ... it's probably the wrong approach in
some ways, but It Works For Me.
- used httplib.HTTPConnection for the HTTP parts, building my own requests
with headers and all, calling h.send() and h.getresponse() etc.
- created my own cookie container class (because there was a session
involved, and logging in and such things, and all of it used cookies)
- subclassed sgmllib.SGMLParser once for each kind of page I expected to
receive. This class knew how to pull the information from a HTML document,
provided it looked as I expected it to. Very tedious work. It can be easier
and safer to just use module re in some cases.
Wrapped in classes this ended up as (fictive):
client = Client('somehost:80)
client.login('me', 'secret)
a, b = theAsAndBs(client, 'tomorrow', 'Wiltshire')
foo = theFoo(client, 'yesterday')
I had to look deeply into the HTTP RFCs to do this, and also snoop the
traffic for a "real" session to see what went on between server and client.
/Jorgen
--
// Jorgen Grahn <jgrahn@ Ph'nglui mglw'nafh Cthulhu
\X/ algonet.se> R'lyeh wgah'nagl fhtagn!
More information about the Python-list
mailing list