Hi, I created a branch "htmlparser" (as opposed to the previous "htmlparse") and used it to rewrite the current parser to support both the XML and HTML parser API of libxml2 (file src/lxml/parser.pxi). Problem: It doesn't work (yet), it crashes. I cut down the problem to find that it is a problem with the deallocation code. Deallocation of HTML trees (or at least "something" in their representation) seems to be different in libxml2 than for XML. The result is a double free of the document or its nodes - once when releasing an element (attemptDeallocation) and again when releasing the document. This is difficult to debug from Python as both usually happen in one step, when the last element is refcounted. And I still haven't found the actual reason for this. However, I found that removing the call to "attemptDeallocation" from _NodeBase.__dealloc__ for HTML trees solves it. So, I'm not sure how to handle this. It may mean that we have to handle object deallocation different depending on the initial parser - which would be very unfortunate. There may also be an additional tweak to be done at parse time, but I wouldn't know what else to try. (Kasimier?) Anyway, whoever wants to try it, just go ahead. Maybe someone else finds a twist into getting this to work. For testing, there are a few test cases in test_htmlparser.py. Note that they will crash, so I can't add them to the automated test suite. You have to run them manually: PYTHONPATH=src python src/lxml/tests/test_htmlparser.py I left a few debug prints in the source, so don't wonder where the output comes from. Any input on this is appreciated. Stefan
Stefan Behnel wrote:
I created a branch "htmlparser" (as opposed to the previous "htmlparse") and used it to rewrite the current parser to support both the XML and HTML parser API of libxml2 (file src/lxml/parser.pxi).
Problem: It doesn't work (yet), it crashes.
Correction: It works *now*. :) There was a special case test for the document node in the lxml element deallocation code - for the /XML/ document node. The HTML document node has to be treated equally. http://codespeak.net/svn/lxml/branch/htmlparser/ I integrated the test case now. Any input on this is still appreciated. If it turns out to work well, this will be merged into the trunk to be integrated in 1.0. Stefan
Stefan Behnel wrote:
Stefan Behnel wrote:
I created a branch "htmlparser" (as opposed to the previous "htmlparse") and used it to rewrite the current parser to support both the XML and HTML parser API of libxml2 (file src/lxml/parser.pxi).
Problem: It doesn't work (yet), it crashes.
Correction: It works *now*. :)
There was a special case test for the document node in the lxml element deallocation code - for the /XML/ document node. The HTML document node has to be treated equally.
http://codespeak.net/svn/lxml/branch/htmlparser/
I integrated the test case now. Any input on this is still appreciated.
If it turns out to work well, this will be merged into the trunk to be integrated in 1.0.
This is great news! I'll grab the branch and give it a try. Excellent, thanks, Stefan! --Paul
participants (2)
-
Paul Everitt
-
Stefan Behnel