[lxml-dev] Named entities ignored
Are named entities ignored by lxml? Even with a suitable DTD declared, the following program does not output the copyright sign; it does, however, if it's provided as a numeric entity. **** from lxml import etree from StringIO import StringIO import sys source = """\ <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html> <p>©2005</p> </html> """ infile = StringIO(source) tree = etree.parse(infile) tree.write(sys.stdout) **** Hamish Lawson
Hamish Lawson wrote:
Are named entities ignored by lxml? Even with a suitable DTD declared, the following program does not output the copyright sign; it does, however, if it's provided as a numeric entity.
**** from lxml import etree from StringIO import StringIO import sys
source = """\ <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html> <p>©2005</p> </html> """
infile = StringIO(source) tree = etree.parse(infile) tree.write(sys.stdout) ****
I haven't looked into this area at all yet unfortunately; I've been ignoring DTD support so far. I'm sure the underlying libxml2 library can do the right thing, the question is which configuration knobs it needs to actually do this... In this case it'd need to go off online (or at least to some local XML catalog) to find the DTD. I think I recall turning off the network access during parsing, so this may be blocking this particular behavior. It would be very helpful if you could turn some of this stuff into test cases to be eventually integrated into lxml's test suite. There's an area: http://codespeak.net/svn/lxml/testcase/ in svn where you can check these in, provided you have access to codespeak. If you do not have svn commit access I may be able to wrangle some for you, or I can check it in for you. Check ou the README.txt in the testcase area for more details on how things are organized. Regards, Martijn
I think I recall turning off the network access during parsing, so this may be blocking this particular behavior.
I thought I would see what would happen if I downloaded the DTD (and its accompanying files) to the local disk and modified the declaration accordingly. First I tried the obvious, a local external DTD: ==== <!DOCTYPE html SYSTEM "xhtml1-transitional.dtd"> <p>©2005</p> ==== The copyright entity was still ignored. Then I tried an internal DTD that includes an external subset: ==== <!DOCTYPE html [ <!ENTITY % xhtml1-transitional SYSTEM "xhtml1-transitional.dtd"> %xhtml1-transitional; ]> <p>©2005</p> ==== This worked! The copyright entity is rendered by tree.write as "©". Out of curiosity I thought I would modify this last version so that the external subset was fetched from the Internet rather than locally: ==== <!DOCTYPE html [ <!ENTITY % xhtml1-transitional SYSTEM "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> %xhtml1-transitional; ]> <p>©2005</p> ==== Surprisingly this also worked (with the longer run time testifying that the DTD was indeed being fetched from the Internet). In conclusion it would seem that network access hasn't actually been disabled. Rather it seems that it's external DTDs that are not supported, whether local or on the Internet. Internal DTDs are supported, and these can include external (local or Internet) subsets. Hamish
Hamish Lawson wrote: [snip]
In conclusion it would seem that network access hasn't actually been disabled. Rather it seems that it's external DTDs that are not supported, whether local or on the Internet. Internal DTDs are supported, and these can include external (local or Internet) subsets.
Interesting, thanks for this analysis. The libxml2 documentation is somewhat vague about it all. lxml uses the following options for parsing: cdef int _getParseOptions(): return (xmlparser.XML_PARSE_NOENT | xmlparser.XML_PARSE_NOCDATA | xmlparser.XML_PARSE_NOWARNING | xmlparser.XML_PARSE_NOERROR) The following options exist according to this page: http://www.xmlsoft.org/html/libxml-parser.html Enum xmlParserOption { XML_PARSE_RECOVER = 1 : recover on errors XML_PARSE_NOENT = 2 : substitute entities XML_PARSE_DTDLOAD = 4 : load the external subset XML_PARSE_DTDATTR = 8 : default DTD attributes XML_PARSE_DTDVALID = 16 : validate with the DTD XML_PARSE_NOERROR = 32 : suppress error reports XML_PARSE_NOWARNING = 64 : suppress warning reports XML_PARSE_PEDANTIC = 128 : pedantic error reporting XML_PARSE_NOBLANKS = 256 : remove blank nodes XML_PARSE_SAX1 = 512 : use the SAX1 interface internally XML_PARSE_XINCLUDE = 1024 : Implement XInclude substitition XML_PARSE_NONET = 2048 : Forbid network access XML_PARSE_NODICT = 4096 : Do not reuse the context dictionnary XML_PARSE_NSCLEAN = 8192 : remove redundant namespaces declarations XML_PARSE_NOCDATA = 16384 : merge CDATA as text nodes XML_PARSE_NOXINCNODE = 32768 : do not generate XINCLUDE START/END nodes } Concerning DTDs, XML_PARSE_DTDLOAD seems interesting. Apparently I did not do XML_PARSE_NONET, as I thought I had, after all. If you want to play around with the various options, edit etree.pyx and find _getParseOptions() at about line 1753. You can recompile using 'make' (or if you want to make sure you're in a clean state: make clean; make). Perhaps you can help figure out what these options really do, so we know which ones we should enable as a default. If we decide we really need the parser to do different stuff under different circumstances (perhaps no network access for security reasons), then we could consider introducing an option somewhere in lxml. One goal of lxml though is to prevent the confusion of so many options that libxml2 offers us. :) Thanks! Martijn
participants (2)
-
Hamish Lawson
-
Martijn Faassen