[python-uk] Favourite ways of scrubbing HTML/whitelisting specific HTML tags?
andy at reportlab.com
Fri Feb 8 10:01:06 CET 2008
On 07/02/2008, Alexander Harrowell <a.harrowell at gmail.com> wrote:
> To clarify, I use BeautifulSoup for a small project that parses frequently
> changing HTML on a number of websites (>1MB each), extracts the content of
> specific tags, filters out certain strings from the content, and serves it
> up in a consistent format. The input HTML comes from the wild, and often
> contains odd tags, funny characters, and other inconsistencies.
> It has so far worked near-perfectly for the last 9 months. Speed appears to
> be a conventional problem with BS, which is why I mentioned it, but when I
> analysed the code in an effort to speed it up I discovered that 90%+ of the
> time taken was accounted for by network latency in getting the data from the
> remote sites.
FWIW, we parse tens of thousands of pages every week to build let
people republish content into nice PDFs. Beautiful Soup was the only
thing that made this sane, as many pages are not structured to be easy
to parse. Like you we found the network was the limit, and simply
kicking off several scraping processes in parallel solved that (e.g.
one run of a script parses hotels from A-F, the next from G-M and so
on...). I can't imagine using anything else.
ReportLab Europe Ltd.
165 The Broadway, Wimbledon, London SW19 1NE, UK
More information about the python-uk