HTML Parsing and Indexing

Andy Dingley dingbat at
Tue Nov 14 00:12:15 CET 2006

mailtogops at wrote:

>     I am involved in one project which tends to collect news
> information published on selected, known web sites inthe format of
> HTML, RSS, etc

I just can't imagine why anyone would still want to do this.

With RSS, it's an easy (if not trivial) problem.

With HTML it's hard, it's unstable, and the legality of recycling
others' content like this is far from clear.  Are you _sure_ there's
still a need to do this thoroughly awkward task?  How many sites are
there that are worth scraping, permit scraping, and don't yet offer RSS

More information about the Python-list mailing list