html5lib not thread safe. Is the Python SAX library thread-safe?
Stefan Behnel
stefan_ml at behnel.de
Mon Mar 12 06:05:34 EDT 2012
John Nagle, 11.03.2012 21:30:
> "html5lib" is apparently not thread safe.
> (see "http://code.google.com/p/html5lib/issues/detail?id=189")
> Looking at the code, I've only found about three problems.
> They're all the usual "cached in a global without locking" bug.
> A few locks would fix that.
>
> But html5lib calls the XML SAX parser. Is that thread-safe?
> Or is there more trouble down at the bottom?
>
> (I run a multi-threaded web crawler, and currently use BeautifulSoup,
> which is thread safe, although dated. I'm looking at converting to
> html5lib.)
You may also consider moving to lxml. BeautifulSoup supports it as a parser
backend these days, so you wouldn't even have to rewrite your code to use
it. And performance-wise, well ...
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
Stefan
More information about the Python-list
mailing list