Extracting text from a Webpage using BeautifulSoup
Marc 'BlackJack' Rintsch
bj_666 at gmx.net
Tue May 27 06:54:05 EDT 2008
On Tue, 27 May 2008 03:01:30 -0700, Magnus.Moraberg wrote:
> I wish to extract all the words on a set of webpages and store them in
> a large dictionary. I then wish to procuce a list with the most common
> words for the language under consideration. So, my code below reads
> the page -
>
> http://news.bbc.co.uk/welsh/hi/newsid_7420000/newsid_7420900/7420967.stm
>
> a welsh language page. I hope to then establish the 1000 most commonly
> used words in Welsh. The problem I'm having is that
> soup.findAll(text=True) is returning the likes of -
>
> u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
> www.w3.org/TR/REC-html40/loose.dtd"'
Just extract the text from the body of the document.
body_texts = soup.body(text=True)
> and -
>
> <a href=" \'+url+\'?rss=\'+rssURI+\'" class="sel"
>
> Any suggestions how I might overcome this problem?
Ask the BBC to produce HTML that's less buggy. ;-)
http://validator.w3.org/ reports bugs like "'body' tag not allowed here"
or closing tags without opening ones and so on.
Ciao,
Marc 'BlackJack' Rintsch
More information about the Python-list
mailing list