[PyAR2] Accessing Web Data

Thu Dec 20 23:06:16 CET 2007

urllib.urlopen() is one of those things that works, but isn't, by default,
nice to server admins.  dive into python has a brief example using urlopen()
to fetch stuff, as well as a piece on processing html.

http://diveintopython.org/html_processing/index.html

diveintopython seems to be targeted, generally speaking, to people with some
sort of background .. so you may have to warp your mind around it for a few
minutes before you get it.

On Dec 20, 2007 1:42 PM, Bob Fahr <bob.fahr at gmail.com> wrote:

> Wayne,
> One approach is to use the urllib library to fetch a page and then use
> regular expressions to find the particular item you want on the page.  Here
> is an example of getting a stock quote from google, the stock symbol is
> passed in as the argument:
>
> import urllib
> import re
>
> def get_quote(symbol):
>     base_url = 'http://finance.google.com/finance?q='
>     content = urllib.urlopen(base_url + symbol).read()
>     m = re.search('class="pr".*?>(.*?)<', content)
>     if m:
>         quote = m.group(1)
>     else:
>         quote = 'no quote available for: ' + symbol
>     return quote
>
> The regular expressions can get fairly complex depending on what
> information you are trying to find.
>
> Another approach is to use the httplib.HTTPConnection and HTMLParser
> libraries.  You use HTTPConnection to create a connection to the website,
> and then use the connection methods to fetch date and HTTP status and other
> info.  Once you have the data you can use the parser to parse the HTML
> tags.  You'll have to overload methods like handle_starttag and then the
> overloaded methods get called during parsing.  I use handle_startag to
> evaluate each start tag in the HTML and find all of the address tags
> (links).  Here's a real short example:
>
> import httplib
> from HTMLParser import HTMLParser
>
> class MyHTMLParser(HTMLParser):
>   def __init__(self):
>     self.links = [];
>     HTMLParser.__init__(self)
>   def handle_starttag(self, tag, attributes):
>     if tag == 'a':
>       name, link = attributes[0]
>       if link.endswith('.html'):
>         self.links.append(link)
>
> # in main
> parser = MyHTMLParser()
> connection = httplib.HTTPConnection ('www.google.com')
> connection.request('GET', '/some_search_url')
> response = connection.getresponse()
> data_length = response.getheader('content-length')
> data = response.read()
> parser.feed(data)
>
> for link in parser.links:
>   do something
>
> Hope this gets you started.
>
> On Dec 20, 2007 1:06 PM, W W < srilyk at gmail.com> wrote:
>
> > Hi,
> >
> > I'm a bit of a beginner at python, and I'm trying to figure out how to
> > use python to retrieve webpages, and so far I'm unsuccessful in my attempts
> > to find any information online.
> >
> > Basically what I'm wanting to do is write a program
> > to search google/yahoo/etc. and return the site content so I can then search it.
> >
> >
> > I'm sure (at least on linux) I could send a system command to wget the
> > file, but that would severely limit cross-platform use, and worse I'm sure
> > it's not very secure.
> >
> > Any pointers on how to do it or where to find the information would be
> > appreciated, thanks!
> > -Wayne
> >
> > --
> > To be considered stupid and to be told so is more painful than being
> > called gluttonous, mendacious, violent, lascivious, lazy, cowardly: every
> > weakness, every vice, has found its defenders, its rhetoric, its ennoblement
> > and exaltation, but stupidity hasn't. - Primo Levi
> > _______________________________________________
> > PyAR2 mailing list
> > PyAR2 at python.org
> > http://mail.python.org/mailman/listinfo/pyar2
> >
> >
>
>
> --
> Bob Fahr
> bob.fahr at gmail.com
> _______________________________________________
> PyAR2 mailing list
> PyAR2 at python.org
> http://mail.python.org/mailman/listinfo/pyar2
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/mailman/private/pyar2/attachments/20071220/d7fc43fe/attachment-0001.htm