[lxml-dev] Entity handling in lxml
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi all, lets make this a new thread to discuss the topic that was raised by Eric Garin. The parsers in lxml are currently configured to replace entity references (&entity;) by their definition. This requires a DTD, either inside the document, as external URL reference or from the system catalog. The parsers do not currently load DTDs by default, neither do they do validation. So, the current situation is: 1) If you use the default parser, all entities will pass through without exception, but put an error message in the error_log: entity.xml:5:ERROR:PARSER:WAR_UNDECLARED_ENTITY: Entity 'oneXXX' not defined They will not be visible at the API level, they will cut off text that contains them ("my &entity; value" will result in a text property value "my "), but they will be serialised correctly. They may also break a lot of things internally, as the implementation is not prepared for dealing with stuff like entity reference nodes. 2) If you configure a parser to load the DTD, declared entities will be replaced and undeclared entities will behave as above. 3) If you configure a parser to validate against a DTD, it will still behave exactly as above. This behaviour is definitely a bug. It would be cleaner to do this: 1) The default parser should replace internally defined entities and report all other entities as an error. 2) A parser that loads the DTD should report undeclared entities as an error (although it would not do any validation). 3) A validating parser should report undeclared entities as an error, just as any other structural or semantic deviation from the DTD. The alternative would be to provide an API for entities and to rewrite the internals to deal with them somehow. We could potentially make entity references a sort of element that behaves more or less like a comment. Entities would mainly have a name and a tail. We would then need an Entity() factory and integrate entity reference nodes into the internal traversal code (basically: let _isElement(c_entity_node) return 1). When would they appear in the tree? We would additionally need a "resolve_entities" keyword argument for the parsers, that would be the easiest way to deal with this. If it is set, unresolvable entities will result in an error as described above. Otherwise, entity references will not be replaced. Any comments? Stefan
participants (1)
-
Stefan Behnel