Code that ought to run fast, but can't due to Python limitations.

John Nagle nagle at animats.com
Tue Jul 7 12:40:15 EDT 2009


Stefan Behnel wrote:
> John Nagle wrote:
>>     I have a small web crawler robust enough to parse
>> real-world HTML, which can be appallingly bad.  I currently use
>> an extra-robust version of BeautifulSoup, and even that sometimes
>> blows up.  So I'm very interested in a new Python parser which supposedly
>> handles bad HTML in the same way browsers do.  But if it's slower
>> than BeautifulSoup, there's a problem.
> 
> Well, if performance matters in any way, you can always use lxml's
> blazingly fast parser first, possibly trying a couple of different
> configurations, and only if all fail, fall back to running html5lib over
> the same input. 

    Detecting "fail" is difficult.  A common problem is badly terminated
comments which eat most of the document if you follow the spec.  The
document seems to parse correctly, but most of it is missing.  The
HTML 5 spec actually covers things like

	<!This is a bogus SGML directive>

and treats it as a bogus comment.  (That's because HTML 5 doesn't
include general SGML; the only directive recognized is DOCTYPE.
Anything else after "<!" is treated as a token-level error.)

    So using an agreed-upon parsing method, in the form of html5lib,
is desirable, in that it should mimic browser behavior.

					John Nagle



More information about the Python-list mailing list