[PyAR2] Accessing Web Data
pyar2 at cowsgomoo.org
Thu Dec 20 23:06:16 CET 2007
urllib.urlopen() is one of those things that works, but isn't, by default,
nice to server admins. dive into python has a brief example using urlopen()
to fetch stuff, as well as a piece on processing html.
diveintopython seems to be targeted, generally speaking, to people with some
sort of background .. so you may have to warp your mind around it for a few
minutes before you get it.
On Dec 20, 2007 1:42 PM, Bob Fahr <bob.fahr at gmail.com> wrote:
> One approach is to use the urllib library to fetch a page and then use
> regular expressions to find the particular item you want on the page. Here
> is an example of getting a stock quote from google, the stock symbol is
> passed in as the argument:
> import urllib
> import re
> def get_quote(symbol):
> base_url = 'http://finance.google.com/finance?q='
> content = urllib.urlopen(base_url + symbol).read()
> m = re.search('class="pr".*?>(.*?)<', content)
> if m:
> quote = m.group(1)
> quote = 'no quote available for: ' + symbol
> return quote
> The regular expressions can get fairly complex depending on what
> information you are trying to find.
> Another approach is to use the httplib.HTTPConnection and HTMLParser
> libraries. You use HTTPConnection to create a connection to the website,
> and then use the connection methods to fetch date and HTTP status and other
> info. Once you have the data you can use the parser to parse the HTML
> tags. You'll have to overload methods like handle_starttag and then the
> overloaded methods get called during parsing. I use handle_startag to
> evaluate each start tag in the HTML and find all of the address tags
> (links). Here's a real short example:
> import httplib
> from HTMLParser import HTMLParser
> class MyHTMLParser(HTMLParser):
> def __init__(self):
> self.links = ;
> def handle_starttag(self, tag, attributes):
> if tag == 'a':
> name, link = attributes
> if link.endswith('.html'):
> # in main
> parser = MyHTMLParser()
> connection = httplib.HTTPConnection ('www.google.com')
> connection.request('GET', '/some_search_url')
> response = connection.getresponse()
> data_length = response.getheader('content-length')
> data = response.read()
> for link in parser.links:
> do something
> Hope this gets you started.
> On Dec 20, 2007 1:06 PM, W W < srilyk at gmail.com> wrote:
> > Hi,
> > I'm a bit of a beginner at python, and I'm trying to figure out how to
> > use python to retrieve webpages, and so far I'm unsuccessful in my attempts
> > to find any information online.
> > Basically what I'm wanting to do is write a program
> > to search google/yahoo/etc. and return the site content so I can then search it.
> > I'm sure (at least on linux) I could send a system command to wget the
> > file, but that would severely limit cross-platform use, and worse I'm sure
> > it's not very secure.
> > Any pointers on how to do it or where to find the information would be
> > appreciated, thanks!
> > -Wayne
> > --
> > To be considered stupid and to be told so is more painful than being
> > called gluttonous, mendacious, violent, lascivious, lazy, cowardly: every
> > weakness, every vice, has found its defenders, its rhetoric, its ennoblement
> > and exaltation, but stupidity hasn't. - Primo Levi
> > _______________________________________________
> > PyAR2 mailing list
> > PyAR2 at python.org
> > http://mail.python.org/mailman/listinfo/pyar2
> Bob Fahr
> bob.fahr at gmail.com
> PyAR2 mailing list
> PyAR2 at python.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the PyAR2