nonstandard XML character entities?
Paul Rubin
http
Sat Apr 14 15:44:43 EDT 2007
"Martin v. Löwis" <martin at v.loewis.de> writes:
> If they contain such things, and do not contain a document type
> definition, they are not well-formed XML files (i.e. can't be
> called "XML" in a meaningful sense).
The documents do have a DTD, however the DTD file doesn't say anything
about these entities.
> It would have been helpful if you had given an example of such
> a document.
I can't post a whole document because these docs are very large
and I'm not sure that the data is public. It does look like the DTD
is public: the document begins with
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE ONIXmessage SYSTEM "http://www.editeur.org/onix/2.1/short/onix-international.dtd">
<ONIXmessage release="2.1">
...
and that url points to the DTD which is online.
Basically the doc has elements like
<b036>Diana Montané</b036>
and both ElementTree and xmllint complain about the character entities
(and there are a lot of them).
> If there is a document type declaration in the document, the best
> way is to parse it in a mode where the parser downloads the DTD
> when parsing it, and resolves the entity references itself.
Hmm, ok, I see there are a lot of <!ENTITY ...> directives in the
DTD but nothing about those character entities--am I looking in
the right place?
> In ElementTree, the XMLTreeBuilder has an attribute entity
> which is a dictionary used to map entity names in entity references
> to their definitions. Whether you can make the parser download
> the DTD itself, I don't know.
Chuck Rhode posted some code for something like this so I'll try it
on Monday.
Thanks!
More information about the Python-list
mailing list