[XML-SIG] Using character entities in external DTD without validating.

Alan Kennedy pyxml@xhaus.com
Mon, 9 Apr 2001 18:56:58 +0100


Hi all,

Firstly, thanks for the great XML software.

I have a small problem. I have several hundred xml files, which contain data
for the members of a scientific association, as well as research abstracts.
I have used a variety of XML tools over the years to process them, including
Java tools with Jpython.

But now I want to use Cpython, for speed and memory reasons.

The problem I have is this. Many of the files contain character entity
references, which refer to ISO-8859-1 characters, as well as other
characters such greek letters (alpha, mu, etc).

The entity references are defined in one central DTD file, which every
single XML refers to using a DOCTYPE declaration. But I do not have an
actual structure for the XML files themselves, they are a pretty random
structure that has grown over the years.

I first started my Cpython/PyXML port by trying to use PyExpat. However,
since PyExpat doesn't read the external subset, it dropped all my character
entities.

Then I tried to the Sax2.Reader from xml.dom.ext.reader. This reads the
external subset, when the vaidate flag is turned on (i.e. the reader is
instantiated like so "reader=Sax2.Reader(validate=1)".

But now the Sax2.Reader is, understandably, insisting that my documents
conform to a structure, which they don't, so I get errors such as "Element
<XYZ> not declared".

Can anyone suggest a way that I can keep the character entity definitions in
an external file, AND read the documents without validating them?

I considered converting all of the documents to ISO-8859-1 encoding, but
doesn't solve the problem of the Greek letters in paper abstracts. I really
don't want to have to define those character entities in the internal subset
of all these documents.

Thanks in advance for any help,

Regards,

Alan.