How to best deal with HTML entities in an XML file?
data:image/s3,"s3://crabby-images/d4c59/d4c59ab2629f45fa029ab7aa5d1e5737f6631d46" alt=""
Hello, I received a bunch of XML files that contain HTML entities (so far, I’ve seen only used). I can’t parse these files with an XML parser because of these HTML entities:
The `resolve_entities` parameter for XMLParser unfortunately doesn’t seem to resolve HTML entities. If I parse the file using an HTMLParser it works:
but then the upper/lower case of all tags is lost because HTML is case-insensitive (XML is not) and it seems that the HTML parser turns all tag names to lower case:
xml.getroot().find("body/*") <Element docxml at 0x1059cb730>
This should be a `DocXML` tag name. Now my original XML file is broken and fails schema validation… So, what now? I feel very hesitant to treat the original XML file as a string and replace HTML entities (except & < >) on a string level. I think a better approach would be to make the XML parser aware of HTML entities but that may be a libxml2 issue rather than lxml? (Haven’t looked at the source yet.) Would you have any other recommendations? How else could I work with this issue? Much thanks! Jens -- Jens Tröger https://savage.light-speed.de/
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Jens Tröger schrieb am 26.03.21 um 08:34:
It's on by default. You can only disable it, in which case the Entities do not get resolved and stay in the tree. That makes the processing a bit more tedious, but it allows passing the entities through into the output as the came in.
If the XML contains entities, then it probably starts with a DOCTYPE declaration. That would refer to a DTD that defines the entities. If that's the case, then load_dtd=True would tell the parser to read the entity definitions from the DTD, so that they can be resolved. Note that you should best configure your locally installed catalogues to include that DTD, so that it won't have to be loaded from the network on each use. Stefan
data:image/s3,"s3://crabby-images/d4c59/d4c59ab2629f45fa029ab7aa5d1e5737f6631d46" alt=""
Hello,
If the XML contains entities, then it probably starts with a DOCTYPE declaration.
After the <?xml> declaration, yes.
I tried with `load_dtd=True` and `dtd_validation=True` and received this error: lxml.etree.XMLSyntaxError: failed to load external entity "https://some.domain/xml/dtd/some.dtd", line 2, column 97 although that file exists and lxml should be able to access the network. That error sent me on the goose chase which triggered my initial email…
Oh, I wasn’t aware of the catalogues and resolvers (https://lxml.de/resolvers.html) that’s great to know! What I tried now is this: class DTDResolver(lxml.etree.Resolver): def resolve(self, url, id, context): if url == "https://some.domain/xml/dtd/some.dtd": return self.resolve_filename("/path/to/local/some.dtd", context) return None parser = lxml.etree.XMLParser(huge_tree=True, dtd_validation=True, load_dtd=True) parser.resolvers.add(DTDResolver()) lxml.etree.parse("test.xml", parser) This loads the XML but I then get an error: lxml.etree.XMLSyntaxError: Content model of div is not determinist: ((argument | byline … )) which is independent of the original problem to resolve the entities and load the XML. I can read the XML file by loading the DTD and disabling validation using `dtd_validation=False`. Not pretty and needs to be resolved (pun intended) by the document owners, but this unblocks me. Looks like this is the proper way to go about this. Much thanks! Jens -- Jens Tröger https://savage.light-speed.de/
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Jens Tröger schrieb am 26.03.21 um 08:34:
It's on by default. You can only disable it, in which case the Entities do not get resolved and stay in the tree. That makes the processing a bit more tedious, but it allows passing the entities through into the output as the came in.
If the XML contains entities, then it probably starts with a DOCTYPE declaration. That would refer to a DTD that defines the entities. If that's the case, then load_dtd=True would tell the parser to read the entity definitions from the DTD, so that they can be resolved. Note that you should best configure your locally installed catalogues to include that DTD, so that it won't have to be loaded from the network on each use. Stefan
data:image/s3,"s3://crabby-images/d4c59/d4c59ab2629f45fa029ab7aa5d1e5737f6631d46" alt=""
Hello,
If the XML contains entities, then it probably starts with a DOCTYPE declaration.
After the <?xml> declaration, yes.
I tried with `load_dtd=True` and `dtd_validation=True` and received this error: lxml.etree.XMLSyntaxError: failed to load external entity "https://some.domain/xml/dtd/some.dtd", line 2, column 97 although that file exists and lxml should be able to access the network. That error sent me on the goose chase which triggered my initial email…
Oh, I wasn’t aware of the catalogues and resolvers (https://lxml.de/resolvers.html) that’s great to know! What I tried now is this: class DTDResolver(lxml.etree.Resolver): def resolve(self, url, id, context): if url == "https://some.domain/xml/dtd/some.dtd": return self.resolve_filename("/path/to/local/some.dtd", context) return None parser = lxml.etree.XMLParser(huge_tree=True, dtd_validation=True, load_dtd=True) parser.resolvers.add(DTDResolver()) lxml.etree.parse("test.xml", parser) This loads the XML but I then get an error: lxml.etree.XMLSyntaxError: Content model of div is not determinist: ((argument | byline … )) which is independent of the original problem to resolve the entities and load the XML. I can read the XML file by loading the DTD and disabling validation using `dtd_validation=False`. Not pretty and needs to be resolved (pun intended) by the document owners, but this unblocks me. Looks like this is the proper way to go about this. Much thanks! Jens -- Jens Tröger https://savage.light-speed.de/
participants (2)
-
Jens Tröger
-
Stefan Behnel