How do I enter/receive webpage information?

Jorgen Grahn jgrahn-nntq at algonet.se
Sat Feb 5 12:41:25 EST 2005


On 4 Feb 2005 15:33:50 -0800, Mudcat <mnations at gmail.com> wrote:
> Hi,
> 
> I'm wondering the best way to do the following.
> 
> I would like to use a map webpage (like yahoo maps) to find the
> distance between two places that are pulled in from a text file. I want
> to accomplish this without displaying the browser.

That's called "web scraping", in case you want to Google for info.

> I am looking at several options right now, including urllib, httplib,
> packet trace, etc. But I don't know where to start with it or if there
> are existing tools that I could incorporate.
> 
> Can someone explain how to do this or point me in the right direction?

I did it this way successfully once ... it's probably the wrong approach in 
some ways, but It Works For Me.

- used httplib.HTTPConnection for the HTTP parts, building my own requests
  with headers and all, calling h.send() and h.getresponse() etc.

- created my own cookie container class (because there was a session
  involved, and logging in and such things, and all of it used cookies)

- subclassed sgmllib.SGMLParser once for each kind of page I expected to
  receive. This class knew how to pull the information from a HTML document,
  provided it looked as I expected it to.  Very tedious work. It can be easier
  and safer to just use module re in some cases.

Wrapped in classes this ended up as (fictive):

client = Client('somehost:80)
client.login('me', 'secret)
a, b = theAsAndBs(client, 'tomorrow', 'Wiltshire')
foo = theFoo(client, 'yesterday')

I had to look deeply into the HTTP RFCs to do this, and also snoop the
traffic for a "real" session to see what went on between server and client.

/Jorgen

-- 
  // Jorgen Grahn <jgrahn@       Ph'nglui mglw'nafh Cthulhu
\X/                algonet.se>   R'lyeh wgah'nagl fhtagn!



More information about the Python-list mailing list