Hi Eric,
please reply also to the mailing list (reply all), not just to me. That way,
you may also get comments by other people and the mails will be archived and
others can search and read them.
Eric Garin wrote:
> Sorry Stefan but I've actually read this documentation (and even several times)
Sorry if I sounded somewhat harsh, I do believe you.
> So I did a test with something really simple :
[parse XML containing an undeclared entity]
> Result : parser says nothing even if &oneXXX; is not declared
Not quite, it does say something, it just doesn't raise an exception.
>>> from lxml import etree
>>> parser = etree.XMLParser()
>>> xml = etree.parse("entity.xml", parser)
Ok, no exception here, so what happened?
>>> print parser.error_log
entity.xml:5:ERROR:PARSER:WAR_UNDECLARED_ENTITY: Entity 'oneXXX' not defined
So, libxml2 did find the missing entity and reported the error to lxml. I
looked into it and it seems that the parser continued parsing and returned a
document containing the entity reference, saying that it was well formed.
Therefore, lxml did not raise an exception. When you serialise the document
after parsing, you will see that the entity reference is still in there,so
this actually works. However, when you print the ".text" of the element
containing the entity reference, it is not printed, so you can see that it is
not passed on to the API level.
Given the normal lxml behaviour of resolving entities and not supporting them
at the API level at all, I would call this a bug.
However, it is not obvious how to deal with this. I mean, entity references
currently pass in and out rather nicely, they are just not visible. So raising
an error here would likely break some existing code that does not explicitly
load DTDs to resolve them but relies on lxml's current behaviour of passing
them through.
On the other hand, there is no easy way to support them at the API level, as
they can occur anywhere in text content. I mean, how should lxml distinguish
between a user passing in the entity "&entity;" and someone who just passes in
the text "In XML, entities are written as &entity;", expecting that it gets
properly escaped like any other text in lxml.etree? So it is better to raise
an error here than to have users deal with entities.
Any opinions on this?
Stefan