[PyAR2] Accessing Web Data

Bob Fahr bob.fahr at gmail.com
Thu Dec 20 20:42:25 CET 2007


Wayne,
One approach is to use the urllib library to fetch a page and then use
regular expressions to find the particular item you want on the page.  Here
is an example of getting a stock quote from google, the stock symbol is
passed in as the argument:

import urllib
import re

def get_quote(symbol):
    base_url = 'http://finance.google.com/finance?q='
    content = urllib.urlopen(base_url + symbol).read()
    m = re.search('class="pr".*?>(.*?)<', content)
    if m:
        quote = m.group(1)
    else:
        quote = 'no quote available for: ' + symbol
    return quote

The regular expressions can get fairly complex depending on what information
you are trying to find.

Another approach is to use the httplib.HTTPConnection and HTMLParser
libraries.  You use HTTPConnection to create a connection to the website,
and then use the connection methods to fetch date and HTTP status and other
info.  Once you have the data you can use the parser to parse the HTML
tags.  You'll have to overload methods like handle_starttag and then the
overloaded methods get called during parsing.  I use handle_startag to
evaluate each start tag in the HTML and find all of the address tags
(links).  Here's a real short example:

import httplib
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
  def __init__(self):
    self.links = [];
    HTMLParser.__init__(self)
  def handle_starttag(self, tag, attributes):
    if tag == 'a':
      name, link = attributes[0]
      if link.endswith('.html'):
        self.links.append(link)

# in main
parser = MyHTMLParser()
connection = httplib.HTTPConnection('www.google.com')
connection.request('GET', '/some_search_url')
response = connection.getresponse()
data_length = response.getheader('content-length')
data = response.read()
parser.feed(data)

for link in parser.links:
  do something

Hope this gets you started.

On Dec 20, 2007 1:06 PM, W W <srilyk at gmail.com> wrote:

> Hi,
>
> I'm a bit of a beginner at python, and I'm trying to figure out how to use
> python to retrieve webpages, and so far I'm unsuccessful in my attempts to
> find any information online.
>
> Basically what I'm wanting to do is write a program
> to search google/yahoo/etc. and return the site content so I can then search it.
>
>
> I'm sure (at least on linux) I could send a system command to wget the
> file, but that would severely limit cross-platform use, and worse I'm sure
> it's not very secure.
>
> Any pointers on how to do it or where to find the information would be
> appreciated, thanks!
> -Wayne
>
> --
> To be considered stupid and to be told so is more painful than being
> called gluttonous, mendacious, violent, lascivious, lazy, cowardly: every
> weakness, every vice, has found its defenders, its rhetoric, its ennoblement
> and exaltation, but stupidity hasn't. - Primo Levi
> _______________________________________________
> PyAR2 mailing list
> PyAR2 at python.org
> http://mail.python.org/mailman/listinfo/pyar2
>
>


-- 
Bob Fahr
bob.fahr at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/mailman/private/pyar2/attachments/20071220/42ddc3ef/attachment.htm 


More information about the PyAR2 mailing list