[XML-SIG] Preserving XML and DocType declaration attributes using DOM

Dinu Gherman gherman@darwin.in-berlin.de
Wed, 20 Mar 2002 10:10:14 +0100 (CET)


"Martin v. Loewis" <martin@v.loewis.de>:

> To my knowledge, the DOM, as specified, does not support this kind of
> operation (atleast not in DOM level 2). Neither does any of the Python
> DOM implementations provide this as an extension.

I read about some limits, but I'm surprised to see that I don't seem to
have access to the systemId and publicId attributes (in my example I had
expected systemId="doc.dtd", but got ""):

    # doc contains this: <!DOCTYPE document SYSTEM "doc.dtd">
    doc = FromXmlStream(path) 
    dt = doc.doctype 
 
    print dt.name, dt.systemId, dt.publicId  
    # gives: document  None 
    print map(type, (dt.name, dt.systemId, dt.publicId))  
    # gives: [<type 'unicode'>, <type 'str'>, <type 'NoneType'>] 


> If you happen to know what the document type declaration should have
> been in the document, you can easily write it back out when printing
> the document.
> 
> If you need roundtrip support for any kind of document type, you best
> select a parser that both
> a) passes document type fragments to the application, and
> b) can be used to build a DOM tree.
> 
> You would then need to hook into the DOM building process, forking the
> DTD data into a separate object.


I thought of using a second SAX-parse to locate the position of the
document root element and prefix the DOM with it. After an hour of
fiddling with all sorts of handlers I've given up and wrote the 
following very pragmatic function that does what I want it to do:

def getXmlAndDoctypeDeclaration(path, rootElementName="document"):
    """Extract XML and DOCTYPE declaration header from an XML file.

    Uses a *very* pragmatic approach ignoring all available PyXML
    goodies...
    """
    for length in (100, 1000, 10000, -1):
        buff = open(path).read(length)
        elem = "<%s" % rootElementName
        if type(elem) != type(''):
            # assume Unicode
            elem = elem.encode('ascii')
        pos = string.find(buff, elem)
        if pos >= 0:
            break

    msg = "Root element named '%s' not found!"
    assert pos >= 0, msg % rootElementName
    return string.strip(buff[:pos])


> Notice that, in general, this is a tricky problem: DTDs are *very*
> expressive, with conditional statements etc, so that an object
> representing the full grammatical structure of a DTD would be quite
> sophisticated.
> 
> If you know that there will never be an internal DTD subset, the
> problem is simplified significantly, as you only need to store public
> and system identifiers, and root element name.

Yes, this is the case for me. But I'll also add the XML declaration
attributes, version, encoding and standalone...

Thanks again!

Dinu