[Tutor] How to read websites - Web Scraping or Parsing in python
Surya K
suryak at live.com
Wed Jun 13 11:06:10 CEST 2012
Hi,
I am trying to write a python program which reads any webpage's content. Considering a blog, I'd like to read all the content written by the author in it.
So, each blog/ site would be having different type of HTML/ XML whether its Blogger or Wordpress or Typepad or any.. I thought of using their RSS/Atom feeds to extract the content.
Then, I used Universal Feed Parser to extract the content.
import feedparserurl = "http://knolzone.com/feed" parsedFeed = feedparser.parse(url)
websiteTitle = parsedFeed.feed.titlefor aArticle in parsedFeed.entries: print aArticle.link print aArticle.summary
I could able to find links of all articles and their summaries with the website's title. But I'd like to read the whole content of a particular article, not just summary
Say, for example we take a webpage http://knolzone.com/unlock-hidden-themes-in-windows-7-and-other-useful-tips-part-5-of-7/. The author had written some article in it and I'd like to read that portion of webpage.
As my target webpage could be anyone of web, and each website's designers could have designed in their own fashion using different "class names", I am unable to figure out how to read "article content" in a webpage.
So, can anyone tell me what libraries should I ultimately use to achieve it ?? and what elements and attributes I should read??
I thought of using BeautifulSoup but I really don't know which elements ( div's or p or a ) I should read. Considering the above webpage given, I consists of lots of "p" elements in it and fortunately all the article content is in "p" elements.. However, there are few other "p" elements which don't belong to article content. In that case how should I eliminate them????
Thanks for reading.. I hope you help.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20120613/10933385/attachment.html>
More information about the Tutor
mailing list