Extracting text from a Webpage using BeautifulSoup

Tue May 27 06:54:05 EDT 2008

On Tue, 27 May 2008 03:01:30 -0700, Magnus.Moraberg wrote:

> I wish to extract all the words on a set of webpages and store them in
> a large dictionary. I then wish to procuce a list with the most common
> words for the language under consideration. So, my code below reads
> the page -
> 
> http://news.bbc.co.uk/welsh/hi/newsid_7420000/newsid_7420900/7420967.stm
> 
> a welsh language page. I hope to then establish the 1000 most commonly
> used words in Welsh. The problem I'm having is that
> soup.findAll(text=True) is returning the likes of -
> 
> u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
> www.w3.org/TR/REC-html40/loose.dtd"'

Just extract the text from the body of the document.

body_texts = soup.body(text=True)

> and -
> 
> <a href=" \'+url+\'?rss=\'+rssURI+\'" class="sel"
> 
> Any suggestions how I might overcome this problem?

Ask the BBC to produce HTML that's less buggy.  ;-)

http://validator.w3.org/ reports bugs like "'body' tag not allowed here"
or closing tags without opening ones and so on.

Ciao,
	Marc 'BlackJack' Rintsch