Re: [lxml-dev] About the position of html parsing by HTML Target parser

[bringing this back to the list] Nicholas Dudfield wrote:
Ah, sure, that works.
Right, I keep forgetting that CDATA is evil. But you can check for CDATA as well.
I'd try that, yes. Should be a lot faster as you only handle the elements in your code, not the text content, for example. And you always know exactly what the next opening/closing tag must be. You can special case self-closing tags by checking for contained text and children. If there isn't anything in them, be prepared to find the next opening element before finding a closing tag. Having the parsed tree available gives you a lot of information that you can exploit. You may also consider using something like ahocorasick to search for all opening tags at once (note that lxml makes the namespace prefix available for each Element), plus "<
[bringing this back to the list]
Sorry about that :)
You may also consider using something like ahocorasick
ahocorasick doesnt seem to work with native unicode, which makes it about as useful (for my particular purpose anyway) as the parser context uft8 stream positions :( Is there any fundamental reason why an xml parser couldn't work with native unicode? ie an abstract character stream? I'm completely clueless when it comes to parsers.

Nicholas Dudfield wrote:
ahocorasick doesnt seem to work with native unicode,
Ah, great. :-/ Never tried it myself, just read about it more than once on c.l.py. Looking at the code, it actually reuses an existing C implementation, that's why.
So the editor you are working with (which one is it, BTW?) gives you unicode strings? I would have expected it to work with byte buffers internally. Or maybe that would be considered an implementation detail that doesn't show at the API level.
Not a fundamental reason, but it's a lot simpler and faster to parse XML streams in UTF-8 than in any other encoding, and it's also more efficient to parse them as a UTF-8 byte stream than as a Unicode character stream, especially in C. You basically read one byte and immediately know if it represents a control character or not. Unicode characters require 4 bytes here to represent all possible code points. Also, UTF-8 is the internal representation format used inside of libxml2 anyway, just for the same reason. So the parser of libxml2 first encodes the stream to UTF-8 (at the I/O stream buffer layer) and then processes it without further modifications. Stefan

So the editor you are working with (which one is it, BTW?)
SublimeText: http://www.sublimetext.com/features It's a windows only, closed source editor which however has quite a few redeeming features, including a Python (2.5) API. All buffer access returns native unicode and the substr(pt1, pt2) indexing is character based rather than byte.
Or maybe that would be considered an implementation detail that doesn't show at the API level.
No idea what encoding it represents characters internally with.
So the parser of libxml2 first encodes the stream to UTF-8 (at the I/O stream buffer layer) and then processes it
When you say it converts first to utf8 internally, does that include recoding xml entities as well? ie A file already utf8 encoded may not necessarily maintain the bytestream after the first stage of processing? eg {'<p>"</p>' : "<p>'</p>"} ps. Thanks very much for your time ( and lxml! )

Nicholas Dudfield wrote:
Sounds like a sensible API design. There is a special thing about parsing from Python unicode strings, BTW. Basically, lxml figures out the platform specific encoding that CPython uses internally (at startup time), and then passes the plain unicode string buffer to libxml2 together with the correct decoding selector. Thus, libxml2 will first recode the UCS2/UCS4 encoded byte sequence into UTF-8, and then parse that.
No, the codec layer only recodes the characters, so you'd get a UTF-8 encoded """ byte string as result. The rest is handled by the XML parser layer, which sees the "&" and considers it the start of an entity/char reference. Stefan

No, the codec layer only recodes the characters, so you'd get a UTF-8 encoded """ byte string as result.
Makes sense. I was hoping that was the case :) Asking more regarding possibilities than your personal inclinations and time allowances, would it be very difficult to add a `character` index attribute (startchpos, endchpos etc) to nodes? I was think something akin to the sourceline attributes Would this be possible purely from lxml/cython land or would libxml2 need to be patched? Being a novice programmer I have little C or Cython experience however it would be an interesting and motivating project to learn on. As `unicode` is the future of python `text` a character based index would be useful (admittedly for not that many uses) regardless of encoding? For my use case, character based positions would be perfect. Please forgive any misconceptions driving boneheaded questions.

Nicholas Dudfield wrote:
The latter. Here's what libxml2 knows about a node: http://xmlsoft.org/html/libxml-tree.html#xmlNode So it doesn't remember any character positions, and it only knows source line numbers up to 65535 (because of memory considerations).
Certainly. However, it does require some work to recover this information from inside the parser framework (due to recoding), so I doubt that it's worth adding such a feature for 'general' use. If you want to dig into this, you'll have to start reading through the source code of libxml2 to figure out where this information could become available. "xmlio.c" might be good place to start, as it implements the I/O routines that copy between encoded buffers.
Please forgive any misconceptions driving boneheaded questions.
It's always fine to ask. Stefan

List, Please excuse me if this question has been answered but I couldn't find anything on the list archives that spelled it out for dummies. My usage situation is this: * I'm using windows * I'm parsing xhtml with the xhtml parser * I'm calling lxml from within a python extensible editor. My problem: * Parsing failures due to `unknown` entities, even quite common ones such as eg. XMLSyntaxError: Entity 'nbsp' not defined, line 11, column 11 How can I set up an external file with common entity definitions that I can parse as an argument to the parser constructor? I read something about a `catalog` but the only docs I could find on it assumed *nix. If someone could help out with a code snippet example I would be very much appreciative. Cheers.

Nicholas Dudfield wrote:
Passing `resolve_entities=False` to the parser constructor ought to work for my case. There seems to be a bug related to this in the feed interface. If you feed the whole document in one go it will honor the constructor, however if you pass it `chunks` ( as you typically would ) it fails. I have attached some test cases. For better or worse they are written to all `pass` proving `errors` using assertRaises. #!/usr/bin/env python #coding: utf8 #################################### IMPORTS ################################### # Std Libs import unittest # 3rd Party Libs from lxml import html from lxml.etree import XMLSyntaxError ################################### CONSTANTS ################################## XHTML_SAMPLE = """\ <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>TEST</title> </head> <body> <p> </p> </body> </html> """ ##################################### TESTS #################################### class TestEntitiesHandling(unittest.TestCase): def test_html_parser_works_fine_with_nbsp(self): html.fromstring(XHTML_SAMPLE, parser=html.HTMLParser()) root = html.fromstring(XHTML_SAMPLE) def test_html_parser_is_used_even_when_xhtml_doctype_used(self): root = html.fromstring(XHTML_SAMPLE) assert type(root.getroottree().parser) is not html.XHTMLParser assert root.xpath('//body') # xmlns is ignored else xpath would fail def test_blows_up_when_using_xhtml_parser(self): def blowup(): html.fromstring(XHTML_SAMPLE, parser=html.XHTMLParser()) # XMLSyntaxError: Entity 'nbsp' not defined, line 11, column 10 self.assertRaises(XMLSyntaxError, blowup) def test_resolve_entities(self): "This works as expected" html.fromstring ( XHTML_SAMPLE, parser=html.XHTMLParser(resolve_entities=False) ) def test_resolve_entities_with_feed_interface__feed_whole_document(self): "This works as expected" parser=html.XHTMLParser(resolve_entities=False) # Feeding the whole XHTML document at once parser.feed(XHTML_SAMPLE) parser.close() def test_resolve_entities_with_feed_interface__chunked(self): "Ridiculous case, but might be showing bug in rare `normal` use cases" parser=html.XHTMLParser(resolve_entities=False) def blowup(): for chs in ( XHTML_SAMPLE[ p:p+64 ] for p in xrange(0, len(XHTML_SAMPLE), 64) ): parser.feed(chs) parser.close() # XMLSyntaxError: Entity 'nbsp' not defined, line 11, column 10 self.assertRaises(XMLSyntaxError, blowup) ##################################### MAIN ##################################### if __name__ == '__main__': unittest.main() ################################################################################

Hi, Nicholas Dudfield wrote:
You can let the parser load the DTD by setting load_dtd=True. lxml will not load DTDs by default and if there is no DTD, the parser will fail on unknown entity references. Also, lxml will not access the network by default, so unless you use a catalog, you must also pass no_network=False. Note that this may slow down parsing considerably, as each document requires loading the DTD from the network first.
I read something about a `catalog` but the only docs I could find on it assumed *nix.
You need to set the XML_CATALOG_FILES environment variable to a space separated list of catalog files. http://xmlsoft.org/catalog.html I have no idea how to install or manage XML catalogs under Windows, though.
This sounds like a bug to me. Could you file a bug report? https://bugs.launchpad.net/lxml Thanks! Stefan

Of course :) I just signed up to LP and filed the report with test cases ( modified from the one I sent earlier to the list: buggy behaviour `fails` ) There was also a (possible bug) I noticed in relation to using XPath searches for text() when a parser was initiated with `strip_cdata=False`. I'll have a look into that now and see if I can write a test case that consistently exposes the bug. I as well noticed a fault in the css parser (used for lxml.cssselect.css_to_xpath) which can put the interpreter in an infinite loop but IIRC the bug was already mentioned on the list.

[bringing this back to the list]
Sorry about that :)
You may also consider using something like ahocorasick
ahocorasick doesnt seem to work with native unicode, which makes it about as useful (for my particular purpose anyway) as the parser context uft8 stream positions :( Is there any fundamental reason why an xml parser couldn't work with native unicode? ie an abstract character stream? I'm completely clueless when it comes to parsers.

Nicholas Dudfield wrote:
ahocorasick doesnt seem to work with native unicode,
Ah, great. :-/ Never tried it myself, just read about it more than once on c.l.py. Looking at the code, it actually reuses an existing C implementation, that's why.
So the editor you are working with (which one is it, BTW?) gives you unicode strings? I would have expected it to work with byte buffers internally. Or maybe that would be considered an implementation detail that doesn't show at the API level.
Not a fundamental reason, but it's a lot simpler and faster to parse XML streams in UTF-8 than in any other encoding, and it's also more efficient to parse them as a UTF-8 byte stream than as a Unicode character stream, especially in C. You basically read one byte and immediately know if it represents a control character or not. Unicode characters require 4 bytes here to represent all possible code points. Also, UTF-8 is the internal representation format used inside of libxml2 anyway, just for the same reason. So the parser of libxml2 first encodes the stream to UTF-8 (at the I/O stream buffer layer) and then processes it without further modifications. Stefan

So the editor you are working with (which one is it, BTW?)
SublimeText: http://www.sublimetext.com/features It's a windows only, closed source editor which however has quite a few redeeming features, including a Python (2.5) API. All buffer access returns native unicode and the substr(pt1, pt2) indexing is character based rather than byte.
Or maybe that would be considered an implementation detail that doesn't show at the API level.
No idea what encoding it represents characters internally with.
So the parser of libxml2 first encodes the stream to UTF-8 (at the I/O stream buffer layer) and then processes it
When you say it converts first to utf8 internally, does that include recoding xml entities as well? ie A file already utf8 encoded may not necessarily maintain the bytestream after the first stage of processing? eg {'<p>"</p>' : "<p>'</p>"} ps. Thanks very much for your time ( and lxml! )

Nicholas Dudfield wrote:
Sounds like a sensible API design. There is a special thing about parsing from Python unicode strings, BTW. Basically, lxml figures out the platform specific encoding that CPython uses internally (at startup time), and then passes the plain unicode string buffer to libxml2 together with the correct decoding selector. Thus, libxml2 will first recode the UCS2/UCS4 encoded byte sequence into UTF-8, and then parse that.
No, the codec layer only recodes the characters, so you'd get a UTF-8 encoded """ byte string as result. The rest is handled by the XML parser layer, which sees the "&" and considers it the start of an entity/char reference. Stefan

No, the codec layer only recodes the characters, so you'd get a UTF-8 encoded """ byte string as result.
Makes sense. I was hoping that was the case :) Asking more regarding possibilities than your personal inclinations and time allowances, would it be very difficult to add a `character` index attribute (startchpos, endchpos etc) to nodes? I was think something akin to the sourceline attributes Would this be possible purely from lxml/cython land or would libxml2 need to be patched? Being a novice programmer I have little C or Cython experience however it would be an interesting and motivating project to learn on. As `unicode` is the future of python `text` a character based index would be useful (admittedly for not that many uses) regardless of encoding? For my use case, character based positions would be perfect. Please forgive any misconceptions driving boneheaded questions.

Nicholas Dudfield wrote:
The latter. Here's what libxml2 knows about a node: http://xmlsoft.org/html/libxml-tree.html#xmlNode So it doesn't remember any character positions, and it only knows source line numbers up to 65535 (because of memory considerations).
Certainly. However, it does require some work to recover this information from inside the parser framework (due to recoding), so I doubt that it's worth adding such a feature for 'general' use. If you want to dig into this, you'll have to start reading through the source code of libxml2 to figure out where this information could become available. "xmlio.c" might be good place to start, as it implements the I/O routines that copy between encoded buffers.
Please forgive any misconceptions driving boneheaded questions.
It's always fine to ask. Stefan

List, Please excuse me if this question has been answered but I couldn't find anything on the list archives that spelled it out for dummies. My usage situation is this: * I'm using windows * I'm parsing xhtml with the xhtml parser * I'm calling lxml from within a python extensible editor. My problem: * Parsing failures due to `unknown` entities, even quite common ones such as eg. XMLSyntaxError: Entity 'nbsp' not defined, line 11, column 11 How can I set up an external file with common entity definitions that I can parse as an argument to the parser constructor? I read something about a `catalog` but the only docs I could find on it assumed *nix. If someone could help out with a code snippet example I would be very much appreciative. Cheers.

Nicholas Dudfield wrote:
Passing `resolve_entities=False` to the parser constructor ought to work for my case. There seems to be a bug related to this in the feed interface. If you feed the whole document in one go it will honor the constructor, however if you pass it `chunks` ( as you typically would ) it fails. I have attached some test cases. For better or worse they are written to all `pass` proving `errors` using assertRaises. #!/usr/bin/env python #coding: utf8 #################################### IMPORTS ################################### # Std Libs import unittest # 3rd Party Libs from lxml import html from lxml.etree import XMLSyntaxError ################################### CONSTANTS ################################## XHTML_SAMPLE = """\ <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>TEST</title> </head> <body> <p> </p> </body> </html> """ ##################################### TESTS #################################### class TestEntitiesHandling(unittest.TestCase): def test_html_parser_works_fine_with_nbsp(self): html.fromstring(XHTML_SAMPLE, parser=html.HTMLParser()) root = html.fromstring(XHTML_SAMPLE) def test_html_parser_is_used_even_when_xhtml_doctype_used(self): root = html.fromstring(XHTML_SAMPLE) assert type(root.getroottree().parser) is not html.XHTMLParser assert root.xpath('//body') # xmlns is ignored else xpath would fail def test_blows_up_when_using_xhtml_parser(self): def blowup(): html.fromstring(XHTML_SAMPLE, parser=html.XHTMLParser()) # XMLSyntaxError: Entity 'nbsp' not defined, line 11, column 10 self.assertRaises(XMLSyntaxError, blowup) def test_resolve_entities(self): "This works as expected" html.fromstring ( XHTML_SAMPLE, parser=html.XHTMLParser(resolve_entities=False) ) def test_resolve_entities_with_feed_interface__feed_whole_document(self): "This works as expected" parser=html.XHTMLParser(resolve_entities=False) # Feeding the whole XHTML document at once parser.feed(XHTML_SAMPLE) parser.close() def test_resolve_entities_with_feed_interface__chunked(self): "Ridiculous case, but might be showing bug in rare `normal` use cases" parser=html.XHTMLParser(resolve_entities=False) def blowup(): for chs in ( XHTML_SAMPLE[ p:p+64 ] for p in xrange(0, len(XHTML_SAMPLE), 64) ): parser.feed(chs) parser.close() # XMLSyntaxError: Entity 'nbsp' not defined, line 11, column 10 self.assertRaises(XMLSyntaxError, blowup) ##################################### MAIN ##################################### if __name__ == '__main__': unittest.main() ################################################################################

Hi, Nicholas Dudfield wrote:
You can let the parser load the DTD by setting load_dtd=True. lxml will not load DTDs by default and if there is no DTD, the parser will fail on unknown entity references. Also, lxml will not access the network by default, so unless you use a catalog, you must also pass no_network=False. Note that this may slow down parsing considerably, as each document requires loading the DTD from the network first.
I read something about a `catalog` but the only docs I could find on it assumed *nix.
You need to set the XML_CATALOG_FILES environment variable to a space separated list of catalog files. http://xmlsoft.org/catalog.html I have no idea how to install or manage XML catalogs under Windows, though.
This sounds like a bug to me. Could you file a bug report? https://bugs.launchpad.net/lxml Thanks! Stefan

Of course :) I just signed up to LP and filed the report with test cases ( modified from the one I sent earlier to the list: buggy behaviour `fails` ) There was also a (possible bug) I noticed in relation to using XPath searches for text() when a parser was initiated with `strip_cdata=False`. I'll have a look into that now and see if I can write a test case that consistently exposes the bug. I as well noticed a fault in the css parser (used for lxml.cssselect.css_to_xpath) which can put the interpreter in an infinite loop but IIRC the bug was already mentioned on the list.
participants (2)
-
Nicholas Dudfield
-
Stefan Behnel