[XML-SIG] Bug with XML file having a doctype declaration

Wed, 26 Mar 2003 10:30:27 +0100

Consider the following test program (You probably recognize it from a
recently fixed bug ...)

--------------><---------------------><--------------------------
#! /usr/bin/env python  =20

import xml.dom
from xml.dom.ext.reader import Sax2

reader =3D Sax2.Reader()
doc    =3D reader.fromString("""<?xml version=3D"1.0" ?>
<!DOCTYPE kasten PUBLIC "-//Jochen Voss//DTD Zettel 1.0//EN" =
"zettel.dtd">
<kasten>
</kasten>
""")
for c in doc.childNodes:
    if c.nodeType=3D=3Dxml.dom.Node.DOCUMENT_TYPE_NODE:
        print "public ID: "+c.publicId
        print "system ID: "+c.systemId
--------------><---------------------><--------------------------

When I let this run with PyXML-0.8.2 I get a stack trace:
Traceback (most recent call last):
  File "./saxtest.py", line 21, in ?
    doc=3Dreader.fromString("""<?xml version=3D"1.0" ?>
  File
"/usr/lib/python2.2/site-packages/_xmlplus/dom/ext/reader/__init__.py", =
line
61, in fromString
    return self.fromStream(stream, ownerDoc)
  File =
"/usr/lib/python2.2/site-packages/_xmlplus/dom/ext/reader/Sax2.py",
line 373, in fromStream
    self.parser.parse(s)
  File "/usr/lib/python2.2/site-packages/_xmlplus/sax/expatreader.py", =
line
107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python2.2/site-packages/_xmlplus/sax/xmlreader.py", =
line
123, in parse
    self.feed(buffer)
  File "/usr/lib/python2.2/site-packages/_xmlplus/sax/expatreader.py", =
line
207, in feed
    self._parser.Parse(data, isFinal)
  File "/usr/lib/python2.2/site-packages/_xmlplus/sax/expatreader.py", =
line
379, in external_entity_ref
    self._source.getSystemId() or
  File "/usr/lib/python2.2/site-packages/_xmlplus/sax/saxutils.py", =
line
515, in prepare_input_source
    f =3D urllib2.urlopen(source.getSystemId())
  File "/usr/lib/python2.2/urllib2.py", line 138, in urlopen
    return _opener.open(url, data)
  File "/usr/lib/python2.2/urllib2.py", line 320, in open
    type_ =3D req.get_type()
  File "/usr/lib/python2.2/urllib2.py", line 224, in get_type
    raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: zettel.dtd

>From this i concluded that the parser wanted to parse the external =
subset of
my doctype declaration. As I don't need any DTD details within my
application I'd like to be able to simply keep that doctypedecl-node =
without
parsing it further.

After digging around in the sources for expatreader I came up with the
following workaround:
--------------><---------------------><--------------------------
#! /usr/bin/env python  =20

from xml.sax.expatreader import \
     ExpatParser, \
     expat

class pyExpatWrapper(ExpatParser):
    """
    Wrapper f=FCr den ExpatParser, der verhindert, dass versucht wird =
das
    externe Subset der DOCTYPE-Spezifikation zu parsen.
    """
    def reset(self):
        ExpatParser.reset(self)
        self._parser.SetParamEntityParsing (
            expat.XML_PARAM_ENTITY_PARSING_NEVER)
import xml.dom
from xml.dom.ext.reader import Sax2

reader =3D Sax2.Reader()
doc    =3D reader.fromString("""<?xml version=3D"1.0" ?>
<!DOCTYPE kasten PUBLIC "-//Jochen Voss//DTD Zettel 1.0//EN" =
"zettel.dtd">
<kasten>
</kasten>
""")
for c in doc.childNodes:
    if c.nodeType=3D=3Dxml.dom.Node.DOCUMENT_TYPE_NODE:
        print "public ID: "+c.publicId
        print "system ID: "+c.systemId
--------------><---------------------><--------------------------

That way everything works
[at least with PyXML-0.8.2, with PyXML-0.7.1 I get=20
public ID:
system ID:
]

is that the way it's meant to be done? Or is there an easier, less =
parser
dependant way to achieve my goal?

Cheers,
=20
Gottfried