[Tutor] python and Beautiful soup question

Steven D'Aprano steve at pearwood.info
Tue Jun 23 02:44:35 CEST 2015


On Mon, Jun 22, 2015 at 12:11:30PM +0200, Timo wrote:
> Op 21-06-15 om 22:04 schreef Joshua Valdez:
> >I'm having trouble making this script work to scrape information from a
> >series of Wikipedia articles.
> >
> >What I'm trying to do is iterate over a series of wiki URLs and pull out
> >the page links on a wiki portal category (e.g.
> >https://en.wikipedia.org/wiki/Category:Electronic_design).
> Instead of scraping the webpage, I'd have a look at the API. This might 
> give much better and more reliable results than to rely on parsing HTML.
> 
> https://www.mediawiki.org/wiki/API:Main_page

Seconded, thirded and fourthed!

Please don't scrape wikipedia. It is hard enough for them to deal with 
bandwidth requirements and remain responsive for browsers without 
badly-written bots trying to suck down pieces of the site. Use the API.

Not only is it the polite thing to do, but it protects you too: 
Wikipedia is entitled to block your bot if they think it is not 
following the rules.

> You can try out the huge amount of different options (with small 
> descriptions) on the sandbox page:
> 
> https://en.wikipedia.org/wiki/Special:ApiSandbox



-- 
Steve


More information about the Tutor mailing list