[Tutor] Web crawling!
vinces1979 at gmail.com
Wed Jul 29 20:53:32 CEST 2009
On Wed, Jul 29, 2009 at 9:59 AM, Raj Medhekar <cosmicsand27 at yahoo.com>wrote:
> Does anyone know a good webcrawler that could be used in tandem with the
> Beautiful soup parser to parse out specific elements from news sites like
> BBC and CNN? Thanks!
> Tutor maillist - Tutor at python.org
I have used httplib2 http://code.google.com/p/httplib2/ to crawl sites(with
auth/cookies) and lxml (html xpath) to parse out links.
but you could use builtin urllib2 to request pages if no auth/cookie support
is required, here is a simple example
from lxml import html
page = urllib2.urlopen("http://this.page.com <http://this.page/>")
data = html.fromstring(page.read())
all_links = data.xpath("//a") # all links on the page
for link in all_links:
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Tutor