HTMLParser compatibility with cPython 2.7.3

Hi, I would like to import changes from: The problem is that HTMLParser from 2.7.2 is not lenient and likes to throw exceptions, when html document is not well formed: http://bugs.python.org/issue13987 This often involves exception from BeautifoulSoup, which gains great speed up when using from pypy + HTMLParser from stdlib: "RuntimeWarning: Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing- a-parser for help." However lxml is not compatibile with PyPy, and html5lib is slow. Can I port the HTMLParser.py from python 2.7.3 to PyPy? -- Robert Zaremba

2012/6/18 Robert Zaremba <robert.zaremba@zoho.com>
Hi, I would like to import changes from: The problem is that HTMLParser from 2.7.2 is not lenient and likes to throw exceptions, when html document is not well formed: http://bugs.python.org/issue13987
This often involves exception from BeautifoulSoup, which gains great speed up when using from pypy + HTMLParser from stdlib: "RuntimeWarning: Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing- a-parser for help."
However lxml is not compatibile with PyPy, and html5lib is slow.
Can I port the HTMLParser.py from python 2.7.3 to PyPy?
In general, no, unless you port the all the rest to 2.7.3 as well. There is already work in progress for this, in the stdlib-2.7.3 branch. It's almost finished (and definitely worth a try), there are some nightly builds there (only 32bit Linux for the moment): http://buildbot.pypy.org/nightly/stdlib-2.7.3/ Still missing are the implementation of randomized hashes (not enabled by default anyway) and a couple of obscure bugs in the import system, probably implementation details. -- Amaury Forgeot d'Arc
participants (2)
-
Amaury Forgeot d'Arc
-
Robert Zaremba