[XML-SIG] Status of XML 1.1 processing in Python?

Wed Aug 31 00:57:51 CEST 2005

I wrote:

>> In a few sentences, could some kind soul summarize the
>> status of XML 1.1 processing using Python XML modules?
>
> I haven't done any extensive testing, but I'm quite sure that sgmlop
> 1.1 supports it.

fwiw, as the following snippet illustrates, ET+sgmlop can read files with
1.1-style character references, but the ET serializer doesn't encode such
characters on the way out.  this script

    from elementtree import ElementTree, SgmlopXMLTreeBuilder
    from StringIO import StringIO

    file = StringIO("<test>this is a backspace: &#x0008;</test>")

    doc = ElementTree.parse(file, SgmlopXMLTreeBuilder.TreeBuilder())

    root = doc.getroot()

    print repr(root.text)
    print repr(ElementTree.tostring(root))

prints

    'this is a backspace: \x08'
    '<test>this is a backspace: \x08</test>'

which isn't entirely correct.

fixing this in ElementTree is pretty straightforward; just tweak the
RE, and make sure _encode_entity is called for all cdata sections.

you can also use the following brute-force runtime patch:

# patch the ET serializer (works with 1.2.X, may break beyond that)
import re
from elementtree import ElementTree
escape = re.compile(u'[&<>\"\x01-\x09\x0b\x0c\x0e-\x1f\u0080-\uffff]+')
ElementTree._encode_entity.func_defaults = (escape,)
ElementTree._escape_cdata = lambda a, b: ElementTree._encode_entity(a)
# end

</F>