XMLParser mode resolve_entities=False and entities in attributes

Hi, this has been discussed before in 11/2009, but the bug seems to persist, so I will try to document it again: If an XML parser is generated with XMLParser(resolve_entities=False), and the document used declares an external DTD, then entities in attributes are inserted into the parent element (if a parent element exists) directly before the element containing that attribute. Expected behaviour: - an error, because entities are undeclared; or, more useful in some cases: - Entities stay in their attributes Workarounds: - Declare an internal DTD that defines all entities - Use an actual external DTD *and* use dtd_validation=True with XMLParser Sample code: (see also: http://pastebin.com/24bM98La -- some more examples there) from lxml import etree parser = etree.XMLParser(resolve_entities=False) try: tree = etree.XML("""<test>1<a href="übel">ö</a></test>""", parser=parser) except etree.XMLSyntaxError as e: print e Output:
Entity 'uuml' not defined, line 1, column 23
from lxml import etree parser = etree.XMLParser(resolve_entities=False) try: tree = etree.XML("""<!DOCTYPE test SYSTEM ""><test>1<a href="übel">ö</a></test>""", parser=parser) print tree[:] print tree.find('.//a').attrib['href'] print etree.tostring(tree) except etree.XMLSyntaxError as e: print e Output:
Tested with various lxml versions, e.g.: Python : sys.version_info(major=2, minor=7, micro=5, releaselevel='final', serial=0) lxml.etree : (3, 4, 0, 0) libxml used : (2, 9, 2) libxml compiled : (2, 9, 2) libxslt used : (1, 1, 28) libxslt compiled : (1, 1, 28) jens

jens quade schrieb am 16.10.2014 um 13:40:
I get the same with plain libxml2: """ $ xmllint - <<EOF <!DOCTYPE test SYSTEM ""><test>1<a href="übel">ö</a></test> EOF -:1: parser error : Entity 'uuml' not defined <!DOCTYPE test SYSTEM ""><test>1<a href="übel">ö</a></test> ^ -:1: parser error : Entity 'ouml' not defined <!DOCTYPE test SYSTEM ""><test>1<a href="übel">ö</a></test> ^ <?xml version="1.0"?> <!DOCTYPE test SYSTEM ""> <test>1ü<a href="bel">ö</a></test> """ Meaning: not a problem in lxml. Please report it on the libxml2 mailing list. Stefan

jens quade schrieb am 16.10.2014 um 13:40:
I get the same with plain libxml2: """ $ xmllint - <<EOF <!DOCTYPE test SYSTEM ""><test>1<a href="übel">ö</a></test> EOF -:1: parser error : Entity 'uuml' not defined <!DOCTYPE test SYSTEM ""><test>1<a href="übel">ö</a></test> ^ -:1: parser error : Entity 'ouml' not defined <!DOCTYPE test SYSTEM ""><test>1<a href="übel">ö</a></test> ^ <?xml version="1.0"?> <!DOCTYPE test SYSTEM ""> <test>1ü<a href="bel">ö</a></test> """ Meaning: not a problem in lxml. Please report it on the libxml2 mailing list. Stefan
participants (2)
-
jens quade
-
Stefan Behnel