[Tutor] New pet project - stripping down html content.

Adam Cripps kabads at gmail.com
Fri Jun 10 12:12:08 CEST 2005


I've been working through some of the early python challenges [1] and
feel enthused to scratch a current itch. However, I want to sound out
my idea for the itch before I start coding to get a perspective on the
direction I should take.

I've recently bought a media player that also displays .txt files. My
itch is to write a script that periodically goes to a news website and
'scrapes' all the relevant information from this. One of my favourites
would the Guardian [2]. The Guardian provide RSS feeds and so I would
like to grab an RSS list and then proceed to download the content for
those 10 or so items. However, here's where the direction is needed.
Obviously, my preferred delivery is .txt without all the <html> tags.
Is there a quick and easy way to strip out html tags and remain with
just the content? And, to be even more pickier, would it be possible
to strip out navigation content and just remain with the bare bones of
the story?

Any pointers for particular libraries I should be looking at would be
very helpful. I've already had a quick play with feedparser [3], which
was intuitive and easy to program with. What about stripping the html?

TIA 
Adam

[1] http://www.pythonchallenge.com
[2] http://www.guardian.co.uk
[3] http://feedparser.org/
-- 
http://www.monkeez.org
PGP key: 0x7111B833


More information about the Tutor mailing list