html5lib not thread safe. Is the Python SAX library thread-safe?
nagle at animats.com
Mon Mar 12 05:48:19 CET 2012
On 3/11/2012 2:45 PM, Cameron Simpson wrote:
> On 11Mar2012 13:30, John Nagle<nagle at animats.com> wrote:
> | "html5lib" is apparently not thread safe.
> | (see "http://code.google.com/p/html5lib/issues/detail?id=189")
> | Looking at the code, I've only found about three problems.
> | They're all the usual "cached in a global without locking" bug.
> | A few locks would fix that.
> | But html5lib calls the XML SAX parser. Is that thread-safe?
> | Or is there more trouble down at the bottom?
> | (I run a multi-threaded web crawler, and currently use BeautifulSoup,
> | which is thread safe, although dated. I'm looking at converting to
> | html5lib.)
> IIRC, BeautifulSoup4 may do that for you:
> "Beautiful Soup 4 uses html.parser by default, but you can plug in
> lxml or html5lib and use that instead."
I want to use HTML5 standard parsing of bad HTML. (HTML5 formally
defines how to parse bad comments, for example.) I currently have
a modified version of BeautifulSoup that's more robust than the
standard one, but it doesn't handle errors the same way browsers do.
More information about the Python-list