[python-uk] Favourite ways of scrubbing HTML/whitelisting specific HTML tags?

Alexander Harrowell a.harrowell at gmail.com
Fri Feb 8 00:45:15 CET 2008


On Thu, Feb 7, 2008 at 7:11 PM, Shaun Laughey <shaun at laughey.com> wrote:

>
> Hi,
> I have used Beautiful Soup for parsing html.
> It works very nicely and I didn't see much of an issue with speed in
> parsing several hundred html files every hour or so.
> I also rolled my own using various regex's and stuff nicked from a
> perl lib. It was awful and feature incomplete. Beautiful Soup worked
> better.
>
> Shaun Laughey.
>

To clarify, I use BeautifulSoup for a small project that parses frequently
changing HTML on a number of websites (>1MB each), extracts the content of
specific tags, filters out certain strings from the content, and serves it
up in a consistent format. The input HTML comes from the wild, and often
contains odd tags, funny characters, and other inconsistencies.

It has so far worked near-perfectly for the last 9 months. Speed appears to
be a conventional problem with BS, which is why I mentioned it, but when I
analysed the code in an effort to speed it up I discovered that 90%+ of the
time taken was accounted for by network latency in getting the data from the
remote sites.

Alex
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-uk/attachments/20080207/e55be9cc/attachment.htm 


More information about the python-uk mailing list