Code that ought to run fast, but can't due to Python limitations.
stefan_ml at behnel.de
Tue Jul 7 11:09:59 CEST 2009
John Nagle wrote:
> I have a small web crawler robust enough to parse
> real-world HTML, which can be appallingly bad. I currently use
> an extra-robust version of BeautifulSoup, and even that sometimes
> blows up. So I'm very interested in a new Python parser which supposedly
> handles bad HTML in the same way browsers do. But if it's slower
> than BeautifulSoup, there's a problem.
Well, if performance matters in any way, you can always use lxml's
blazingly fast parser first, possibly trying a couple of different
configurations, and only if all fail, fall back to running html5lib over
the same input. That should give you a tremendous speed-up over your
current code in most cases, while keeping things robust in the hard cases.
Note the numbers that Ian Bicking has for HTML parser performance:
You should be able to run lxml's parser ten times in different
configurations (e.g. different charset overrides) before it even reaches
the time that BeautifulSoup would need to parse a document once. Given that
undeclared character set detection is something where BS is a lot better
than lxml, you can also mix the best of both worlds and use BS's character
set detection to configure lxml's parser if you notice that the first
parsing attempts fail.
And yes, html5lib performs pretty badly in comparison (or did, at the
time). But the numbers seem to indicate that if you can drop the ratio of
documents that require a run of html5lib below 30% and use lxml's parser
for the rest, you will still be faster than with BeautifulSoup alone.
More information about the Python-list