[Tutor] extract plain english words from html

Sat Oct 15 01:51:57 CEST 2005

On Fri, 14 Oct 2005, Marc Buehler wrote:

> i have a ton of html files from which i want to extract the plain
> english words, and then write those words into a single text file.

Hi Marc,

The BeautifulSoup parser should be able to do what you want:

    http://www.crummy.com/software/BeautifulSoup/

For example:

######
>>> from BeautifulSoup import BeautifulSoup
>>> import urllib
>>> import re
>>> soup = BeautifulSoup(urllib.urlopen('http://python.org/index.html'))
>>> for chunk in soup.fetch('a'):
...     print chunk.fetchText(re.compile('.+'))
...
[]
['Search']
['Download']
['Documentation']
['Help']
['Developers']
['Community']
['SIGs']
['What is Python?']
['Python FAQs']
['Python 2.4']
['(docs)']
['Python 2.3']
['(docs)']
['Python 2.2']
['(docs)']
['MacPython']
['Jython']
... [lots of output here]
######

And this allows us to quickly get all the hyperlinked text off the
Python.org web site.

It's not perfect, but it can deal surprisingly well with ugly HTML.  Hope
this helps!