Re: [lxml-dev] HTMLParser behavior - ill formed HTML
Hi Dean, Dean Pavlekovic wrote:
On 5/2/06, Stefan Behnel <behnel_ml@gkec.informatik.tu-darmstadt.de> wrote:
I changed the trunk to always accept ill-formed results if the recover option is set and no lxml-internal errors were raised. Please try if that helps. I couldn't come up with a sufficiently short example of broken HTML where this problem occurs, so I couldn't test it. The examples that were tested so far can be found in src/lxml/tests/test_htmlparser.py.
Unfortunately it's not working and I'm sorry but I didn't have the time to look at it deeper. It appears that the value of options is changed somewhere in libxml context during the htmlCtxtReadFile. I printed out the value of parser's self._parse_options before (97, that is XML_PARSE_RECOVER_FLAG set) and the value of ctxt.options after the htmlCtxtReadFile call which was 96 meaning the flag was reset by libxml.
Interesting. Could you tell me what version of libxml2 you are using (see below)? My guess is that it's older than 2.6.21. Libxml2 copies the options by hand, so if the RECOVER option is unknown, it will not turn up in the context options. That makes me wonder why recovery worked for Paul on 2.6.16 in the first place... Anyway. I changed the trunk to pass the option explicitly to _handleParseResult so that we no longer rely on libxml2. Please test if this works for you now. I'd be glad if you could come up with a short piece of broken HTML code that triggers the "not well formed" case in recovery mode. That would allow us to set up a test case to check that it actually works (and keeps working). To check which versions you are using, I added module attributes "LXML_VERSION", "LIBXML_VERSION" and "LIBXSLT_VERSION" that carry tuples containing the respective versions used by lxml, e.g. (2, 6, 23) for "2.6.23". "LIBXML_COMPILED_VERSION" and "LIBXSLT_COMPILED_VERSION" show against which versions lxml was compiled. BTW, these attributes are mainly meant for debugging purposes. Since they will first (officially) appear in lxml 1.0, code that wants to use them anyway will have to check if they are available (hasattr or try-except) before accessing them. All of them appeared at the same time, so if one is there, the others will be there, too.
Btw. is there a reason that the XML_... enums are assigned an explicit value in xmlparser.pxd and the same is not done for HTML_... enums in htmlparser.pxd? I'm not familiar with Pyrex and wheter it parses c/c++ .h files to match values...
Pyrex doesn't care about the values of enums and C uses the .h files directly, so, since they are just copy&pasted into the .pxd files, they sometimes have numbers and sometimes not. Stefan
Hello Stefan, Now, after running with your new patches, it works well! And about the previous:
Interesting. Could you tell me what version of libxml2 you are using (see below)? My guess is that it's older than 2.6.21. Libxml2 copies the options by hand, so if the RECOVER option is unknown, it will not turn up in the context options. That makes me wonder why recovery worked for Paul on 2.6.16 in the first place...
I am using libxml2 version 2.6.21 (on Ubuntu: libxml2 2.6.21-0ubuntu1 package): lrwxrwxrwx 1 root root 17 2006-05-01 21:31 /usr/lib/libxml2.so -> libxml2.so.2.6.21 I've confirmed this after applying your patches LIBXML_VERSION and LIBXML_COMPILED_VERSION both equate to (2, 6, 21) Just to confirm the behavior from my last email, I changed the HTMLParser by adding some print statements before and after htmlCtxtReadFile call (before your latest patches) --- cdef xmlDoc* _parseDocFromFile(self, char* c_filename) except NULL: ... self._initContext(pctxt) print 'In _parseDocFromFile - before htmlCtxtReadFile: self._parse_options=%s' % self._parse_options result = htmlparser.htmlCtxtReadFile( pctxt, c_filename, NULL, self._parse_options) print 'In _parseDocFromFile - after htmlCtxtReadFile: ctxt.options=%s' % pctxt.options self._error_log.disconnect() return _handleParseResult(pctxt, result, c_filename) --- and if I run this script - lxmltest1.py (files attached): from lxml import etree doc = etree.parse('httest.html', parser=etree.HTMLParser(recover=True)) etree.dump(doc) the result is: dean@boycie:~/work/main/oldstuff/dean/re$ python lxmltest1.py In _parseDocFromFile - before htmlCtxtReadFile: self._parse_options=97 <<<<< here In _parseDocFromFile - after htmlCtxtReadFile: ctxt.options=96 <<<< here Traceback (most recent call last): File "lxmltest1.py", line 2, in ? doc = etree.parse('httest.html', parser=etree.HTMLParser(recover=True)) File "etree.pyx", line 1401, in etree.parse File "parser.pxi", line 489, in etree._parseDocument File "parser.pxi", line 464, in etree._parseDocFromFile File "parser.pxi", line 437, in etree.HTMLParser._parseDocFromFile File "parser.pxi", line 177, in etree._handleParseResult etree.XMLSyntaxError: htmlParseEntityRef: expecting ';' (the error is because of unescaped &-s in href attribute value) Next I've made a C test using 'plain' libxml2 (pls. see lxmltest2.c), and this behavior was confirmed. So I guess this is libxml2 issue. Or it maybe something specific to my local setup if you don't manage to reproduce it... ( gcc -o lxmltest2 -I/usr/include/libxml2 -lxml2 lxmltest2.c) Although it's an odd feature/bug, hope it's useful to know about it :-/ Best regards, Dean PS. The httest.html should be in cp1250 encoding (Windows east european).
Anyway. I changed the trunk to pass the option explicitly to _handleParseResult so that we no longer rely on libxml2. Please test if this works for you now. I'd be glad if you could come up with a short piece of broken HTML code that triggers the "not well formed" case in recovery mode. That would allow us to set up a test case to check that it actually works (and keeps working).
To check which versions you are using, I added module attributes "LXML_VERSION", "LIBXML_VERSION" and "LIBXSLT_VERSION" that carry tuples containing the respective versions used by lxml, e.g. (2, 6, 23) for "2.6.23". "LIBXML_COMPILED_VERSION" and "LIBXSLT_COMPILED_VERSION" show against which versions lxml was compiled.
BTW, these attributes are mainly meant for debugging purposes. Since they will first (officially) appear in lxml 1.0, code that wants to use them anyway will have to check if they are available (hasattr or try-except) before accessing them. All of them appeared at the same time, so if one is there, the others will be there, too.
Btw. is there a reason that the XML_... enums are assigned an explicit value in xmlparser.pxd and the same is not done for HTML_... enums in htmlparser.pxd? I'm not familiar with Pyrex and wheter it parses c/c++ .h files to match values...
Pyrex doesn't care about the values of enums and C uses the .h files directly, so, since they are just copy&pasted into the .pxd files, they sometimes have numbers and sometimes not.
Stefan
Hi Dean, Dean Pavlekovic wrote:
Now, after running with your new patches, it works well!
Fine, then we have found a work-around that works on older libxml2 versions.
Libxml2 copies the options by hand, so if the RECOVER option is unknown, it will not turn up in the context options.
I am using libxml2 version 2.6.21 (on Ubuntu: libxml2 2.6.21-0ubuntu1 package): /usr/lib/libxml2.so -> libxml2.so.2.6.21 I've confirmed this after applying your patches LIBXML_VERSION and LIBXML_COMPILED_VERSION both equate to (2, 6, 21)
Thanks. I took a second look at it. The version does not matter, the respective code in libxml2 2.6.21 to 2.6.23 (current) looks like this: if (options & HTML_PARSE_RECOVER) { ctxt->recovery = 1; } else ctxt->recovery = 0; if (options & HTML_PARSE_COMPACT) { ctxt->options |= HTML_PARSE_COMPACT; options -= HTML_PARSE_COMPACT; } So, as opposed to most other options, the RECOVER option is not copied at all and not even removed from the original options to show that it was accepted (as is written in the docs). I'll file a bug report on it. The work-around will just stay in lxml as is. Stefan
participants (2)
-
Dean Pavlekovic
-
Stefan Behnel