[Tutor] How to read websites - Web Scraping or Parsing in python

Wed Jun 13 12:09:36 CEST 2012

On 13/06/12 10:06, Surya K wrote:

> As my target webpage could be anyone of web, and each website's
> designers could have designed in their own fashion using different
> "class names", I am unable to figure out how to read "article content"
> in a webpage.

This is always the problem with scraping webpages, you are dependant on 
how the individual author structures their pages. And if they change the 
format it will likely break your scraper. Also some web sites implement 
devices to deliberately make it hard to scrape the pages - such as 
changing the div/class names arbitrarily. This is to encourage you to 
use their web site and see the beautiful adverts they have on display 
and that pay for the service.

> So, can anyone tell me what libraries should I ultimately use to achieve
> it ?? and what elements and attributes I should read??

The most basic are urllib, urllib2 and httplib in the standard library

> I thought of using BeautifulSoup but I really don't know which elements
> ( div's or p or a ) I should read.

BS is good but you will need to know which tags you are interested in.
Usually the low level <p> tags are inside a <div> so you can locate the 
<div> and only fetch the <p>'s from that section. But you will need to 
do some digging and probably some trial and error. - put it in a 
module/class and use the >>> prompt to experiment is my advice.

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/