[lxml-dev] XMLSchema validate and entities
data:image/s3,"s3://crabby-images/27cbf/27cbf5c65677c7df52dc0cf0a682420f14cdc005" alt=""
Hello At first, congratulations, I'm using lxml for more that one year and enjoy the huge progress (and work) you have done. I'm using lxml to validate XML documents instances with etree.XMLSchema(schema_doc).validate(xml_doc). I've used to work with DTD's where it's possible to include standard sets of HTML entities declarations like for example for ( é etc ...). Now, working with XML schemas, sometimes I have some of those common HTML entities that appears (from an editor like FCK) in the content. And at the validation time, of course, I have an error like this : File "lxml.etree.pyx", line 2520, in lxml.etree.parse File "parser.pxi", line 1309, in lxml.etree._parseDocument File "parser.pxi", line 1338, in lxml.etree._parseDocumentFromURL File "parser.pxi", line 1248, in lxml.etree._parseDocFromFile File "parser.pxi", line 828, in lxml.etree._BaseParser._parseDocFromFile File "parser.pxi", line 452, in lxml.etree._ParserContext._handleParseResultDoc File "parser.pxi", line 536, in lxml.etree._handleParseResult File "parser.pxi", line 478, in lxml.etree._raiseParseError lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 21, column 16 1. Is there a way to escape those entities at validation time ? 2. Or Do I need to declare entities in the schema (I understand that this question is not in the lxml topic, but I didn't find a way to do that) Thank you. Eric ----------------------------------- * Eric Garin - eric@detede.com * * Entity XML Editorial * * www.detede.com * -----------------------------------
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi, Eric Garin wrote:
At first, congratulations, I'm using lxml for more that one year and enjoy the huge progress (and work) you have done.
:) Happy to hear that.
I'm using lxml to validate XML documents instances with etree.XMLSchema(schema_doc).validate(xml_doc). I've used to work with DTD's where it's possible to include standard sets of HTML entities declarations like for example for ( é etc ...).
Now, working with XML schemas, sometimes I have some of those common HTML entities that appears (from an editor like FCK) in the content. And at the validation time, of course, I have an error like this :
File "lxml.etree.pyx", line 2520, in lxml.etree.parse File "parser.pxi", line 1309, in lxml.etree._parseDocument File "parser.pxi", line 1338, in lxml.etree._parseDocumentFromURL File "parser.pxi", line 1248, in lxml.etree._parseDocFromFile File "parser.pxi", line 828, in lxml.etree._BaseParser._parseDocFromFile File "parser.pxi", line 452, in lxml.etree._ParserContext._handleParseResultDoc File "parser.pxi", line 536, in lxml.etree._handleParseResult File "parser.pxi", line 478, in lxml.etree._raiseParseError lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 21, column 16
1. Is there a way to escape those entities at validation time ?
The stack trace above shows up at parse time. If you have entity references in your XML document, you have to use a DTD at parse time that defines them, or you can pass the "resolve_entities=False" option to the parser to keep them in the tree (which might make tree handling a little harder, though).
2. Or Do I need to declare entities in the schema (I understand that this question is not in the lxml topic, but I didn't find a way to do that)
XML Schema deliberately does not support entity declarations (or references, for that purpose). They are a pure DTD thing. Stefan
participants (2)
-
Eric Garin
-
Stefan Behnel