Scraping Wikipedia with Python
andreengels at gmail.com
Thu Aug 13 11:58:15 CEST 2009
On Tue, Aug 11, 2009 at 8:53 PM, David C Ullrich<dullrich at sprynet.com> wrote:
> Try reading a little there! Starting there I went to
> where I found a section on existing bots, comments on how the "scraping"
> is not what you want, and even a Python section with a link to something
> labelled PyWikipediaBot...
Some information on using the PyWikipediaBot for scraping from someone
who used to program on the bot (and occasionally still does):
To make the framework work, you need to add a file user-config.py with
the following contents:
family = 'wikipedia'
mylang = 'en'
If you want to use the bot to also edit pages on wikipedia, you will
have to add:
usernames['wikipedia']['en'] = <the username of your bot>
If you work on another language of course you use that language's
abbreviation instead of en.
The heart of the framework is the file wikipedia.py, you need to
import that one. It contains two important classes: Page and Site,
which represent a wikipedia page and the site as a whole,
It is best to put your code in a try like this:
mysite = wikipedia.getSite()
<your code here>
The stopme() functionality has to do with the bot's behaviour to avoid
over-feeding the server with requests. It has a certain time (default
is 10 seconds) between two requests, but if you have several bots
running, it will lengthen this time. stopme() tells that the bot is
not running any more, so other runs are not delayed by it.
wikipedia.getSite() gets the site object for your default site (if the
settings above are chosen it is the English language Wikipedia).
Still with me? Good, because now we get into the real programming.
The Page class has as its __init__:
def __init__(self, site, title, insite=None, defaultNamespace=0):
site is here the wiki on which the page exists (usually this will be
mysite, which is why I defined it above), title the title of the page.
The optional parameters are for special usage.
The Page class has a number of methods, which you can find in the
file, but some of the most important are:
page.title() - the title of the page
page.site() - the wiki the page is on
page.get() - the (wiki) text of the page
page.put(text) - saves the page with 'text' as its new content. An
important optional parameter is 'comment', which specifies the summary
that is given with the change
page.exists() - a boolean, true if the page exists, false otherwise
page.linkedPages() - a list of Page objects, being the pages the page links to
However, instead of page.get() it is advisable to use:
with 'site' being a Site object (e.g. mysite) and pages a list (or
more generally, iterable) of Page objects. It will get all pages in
the list using a single call to the wiki, thus speeding up your bot
and at the same time reducing its load on the wiki. Once a page has
been loaded (either through get or through getall), subsequent calls
to page.get() will not reload it. Thus, the normal way of working is
to create a list of pages one is interested in, use getall (in groups
of 60 or so) to load them, then use get to work with them.
Another useful file in the framework is pagegenerators. It provides a
number of generators that yield Page objects. Some interesting ones
(check the code for the exact parameters):
AllpagesPageGenerator: generates all pages of the wiki, alphabetically
from a specified begin
ReferringPageGenerator: all pages linking to a given page
CategorizedPageGenerator: all pages in a given directory
LinkedPageGenerator: all pages linked to from a given page
Other generators are used by 'wrapping them around' a given generator.
The most important of these is the PreloadingGenerator, which ensures
that the page are preloaded (using wikipedia.getall) in groups.
A simple way to use the bot framework to scrape all pages of the
English Wikipedia (warning: This takes a few days!) would be:
basicgen = pagegenerators.AllpagesPageGenerator(includeredirects = False)
generator = pagegenerators.PreloadingGenerator(basicgen, 200)
for page in generator:
title = page.title()
text = page.get()
<do whatever you want with title and text>
André Engels, andreengels at gmail.com
More information about the Python-list