[Tutor] How to read websites - Web Scraping or Parsing in python
Alan Gauld
alan.gauld at btinternet.com
Wed Jun 13 12:09:36 CEST 2012
On 13/06/12 10:06, Surya K wrote:
> As my target webpage could be anyone of web, and each website's
> designers could have designed in their own fashion using different
> "class names", I am unable to figure out how to read "article content"
> in a webpage.
This is always the problem with scraping webpages, you are dependant on
how the individual author structures their pages. And if they change the
format it will likely break your scraper. Also some web sites implement
devices to deliberately make it hard to scrape the pages - such as
changing the div/class names arbitrarily. This is to encourage you to
use their web site and see the beautiful adverts they have on display
and that pay for the service.
> So, can anyone tell me what libraries should I ultimately use to achieve
> it ?? and what elements and attributes I should read??
The most basic are urllib, urllib2 and httplib in the standard library
> I thought of using BeautifulSoup but I really don't know which elements
> ( div's or p or a ) I should read.
BS is good but you will need to know which tags you are interested in.
Usually the low level <p> tags are inside a <div> so you can locate the
<div> and only fetch the <p>'s from that section. But you will need to
do some digging and probably some trial and error. - put it in a
module/class and use the >>> prompt to experiment is my advice.
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
More information about the Tutor
mailing list