Code that ought to run fast, but can't due to Python limitations.
nagle at animats.com
Tue Jul 7 08:31:26 CEST 2009
Steven D'Aprano wrote:
> On Sun, 05 Jul 2009 10:12:54 +0200, Hendrik van Rooyen wrote:
>> Python is not C.
> John Nagle is an old hand at Python. He's perfectly aware of this, and
> I'm sure he's not trying to program C in Python.
> I'm not entirely sure *what* he is doing, and hopefully he'll speak up
> and say, but whatever the problem is it's not going to be as simple as
I didn't write this code; I'm just using it. As I said in the
original posting, it's from "http://code.google.com/p/html5lib".
It's from an effort to write a clean HTML 5 parser in Python for
general-purpose use. HTML 5 parsing is well-defined for the awful
cases that make older browsers incompatible, but quite complicated.
The Python implementation here is intended partly as a reference
implementation, so browser writers have something to compare with.
I have a small web crawler robust enough to parse
real-world HTML, which can be appallingly bad. I currently use
an extra-robust version of BeautifulSoup, and even that sometimes
blows up. So I'm very interested in a new Python parser which supposedly
handles bad HTML in the same way browsers do. But if it's slower
than BeautifulSoup, there's a problem.
More information about the Python-list