Re: [lxml-dev] lxml lets undeclared entities pass through silently (was: Xhtml and entities)
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi Eric, please reply also to the mailing list (reply all), not just to me. That way, you may also get comments by other people and the mails will be archived and others can search and read them. Eric Garin wrote:
Sorry Stefan but I've actually read this documentation (and even several times)
Sorry if I sounded somewhat harsh, I do believe you.
Not quite, it does say something, it just doesn't raise an exception.
Ok, no exception here, so what happened?
print parser.error_log entity.xml:5:ERROR:PARSER:WAR_UNDECLARED_ENTITY: Entity 'oneXXX' not defined
So, libxml2 did find the missing entity and reported the error to lxml. I looked into it and it seems that the parser continued parsing and returned a document containing the entity reference, saying that it was well formed. Therefore, lxml did not raise an exception. When you serialise the document after parsing, you will see that the entity reference is still in there,so this actually works. However, when you print the ".text" of the element containing the entity reference, it is not printed, so you can see that it is not passed on to the API level. Given the normal lxml behaviour of resolving entities and not supporting them at the API level at all, I would call this a bug. However, it is not obvious how to deal with this. I mean, entity references currently pass in and out rather nicely, they are just not visible. So raising an error here would likely break some existing code that does not explicitly load DTDs to resolve them but relies on lxml's current behaviour of passing them through. On the other hand, there is no easy way to support them at the API level, as they can occur anywhere in text content. I mean, how should lxml distinguish between a user passing in the entity "&entity;" and someone who just passes in the text "In XML, entities are written as &entity;", expecting that it gets properly escaped like any other text in lxml.etree? So it is better to raise an error here than to have users deal with entities. Any opinions on this? Stefan
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Stefan Behnel wrote:
Here's a trivial patch that raises an exception in this case. Still not sure this is the right solution, though. Stefan Index: src/lxml/parser.pxi =================================================================== --- src/lxml/parser.pxi (Revision 43690) +++ src/lxml/parser.pxi (Arbeitskopie) @@ -622,7 +622,8 @@ ctxt.myDoc = NULL if result is not NULL: - if ctxt.wellFormed or recover: + if recover or (ctxt.wellFormed and \ + ctxt.lastError.level < xmlerror.XML_ERR_ERROR): __GLOBAL_PARSER_CONTEXT.initDocDict(result) else: # free broken document
data:image/s3,"s3://crabby-images/27cbf/27cbf5c65677c7df52dc0cf0a682420f14cdc005" alt=""
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Stefan Behnel wrote:
Here's a trivial patch that raises an exception in this case. Still not sure this is the right solution, though. Stefan Index: src/lxml/parser.pxi =================================================================== --- src/lxml/parser.pxi (Revision 43690) +++ src/lxml/parser.pxi (Arbeitskopie) @@ -622,7 +622,8 @@ ctxt.myDoc = NULL if result is not NULL: - if ctxt.wellFormed or recover: + if recover or (ctxt.wellFormed and \ + ctxt.lastError.level < xmlerror.XML_ERR_ERROR): __GLOBAL_PARSER_CONTEXT.initDocDict(result) else: # free broken document
data:image/s3,"s3://crabby-images/27cbf/27cbf5c65677c7df52dc0cf0a682420f14cdc005" alt=""
participants (2)
-
Eric Garin
-
Stefan Behnel