[XML-SIG] pulldom with XML 1.1 problem

Sat Aug 28 14:44:21 CEST 2004

                 Newbie problem:  pulldom with XML 1.1

The Question: 
    How can I make pulldom parse according to XML 1.1 conventions?
    Or:  Is there an upgrade of pulldom that handles XML 1.1?
    Or:  Is there some other XML 1.1 parsing solution in Python?

Background:  I'm running
Python 2.3.3 (#1, Feb 17 2004, 11:48:35)
[GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-5)] on linux2

Illustration of my problem:

I start with the following simple xml file, call it test.xml

<?xml version="1.0" encoding="utf-8"?>

<foo>
  <bar>first line of text</bar>
  <bar>second line of text</bar>
  <bar>third line of text</bar>
  <bar>&#x0061;&#x0062;&#x0063;</bar>
</foo>

and the following Relax NG schema (compact syntax), call it test.rng

grammar {
  start = element foo {
    element bar {text}+
  }
}

Validation of test.xml succeeds using the Jing validating parser:

java -jar jing.jar -c test.rng test.xml

So far so good.

****** Now for XML 1.0 vs. XML 1.1 ...

In XML 1.0, all characters below x20 are invalid as characters in an XML 
file
except for x9, xA and xD.
So if I change test.xml to the following (call it test1.0.xml), adding &#x8;

<?xml version="1.0" encoding="utf-8"?>

<foo>
  <bar>first line of text</bar>
  <bar>second line of text</bar>
  <bar>third line of text</bar>
  <bar>&#x0061;&#x0062;&#x0063;&#x8;</bar>  
</foo>

then Jing rightly complains that the file is not XML 1.0 valid, because 
of the illegal
&#x8; character.

However, &#x8;  _is_ valid in XML 1.1, so the following file (call it 
test1.1.xml)

<?xml version="1.1" encoding="utf-8"?>

<!-- N.B. change in line above to version="1.1" -->

<foo>
  <bar>first line of text</bar>
  <bar>second line of text</bar>
  <bar>third line of text</bar>
  <bar>&#x0061;&#x0062;&#x0063;&#x8;</bar>  
</foo>

is (correctly) accepted by Jing as valid XML 1.1.

************************

Problem:  pulldom handles test.xml (which lacks the offending &#x8;) but
   chokes on both test1.0.xml (which contains an invalid &#x8;) and 
test1.1.xml
   (which contains a valid &#x8;).

   It should fail for test1.0.xml and succeed for test1.1.xml (just like 
Jing does).

Here's a little test script (call it test.py) using pulldom to print the 
text in each
<bar> element:

#!/usr/bin/env python

import sys
from xml.dom import pulldom

infile = sys.argv[1]

events = pulldom.parse(infile)

def getText(nodelist):
    rc = ""
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc += node.data
    return rc

for (event, node) in events:
    if event == pulldom.START_ELEMENT and node.tagName == "bar":
        events.expandNode(node)
        print getText(node.childNodes)

# end of script

Invoking from the command line

  test.py test.xml

succeeds and outputs

  first line of text
  second line of text
  third line of text
  abc

But invoking

   test.py test1.0.xml
or
   test.py test1.1.xml

fails and gives the following traceback:

Traceback (most recent call last):
  File "test.py", line 17, in ?
    for (event, node) in events:
  File 
"/opt/STools/lib/python2.3/site-packages/_xmlplus/dom/pulldom.py", line 
232, in next
    rc = self.getEvent()
  File 
"/opt/STools/lib/python2.3/site-packages/_xmlplus/dom/pulldom.py", line 
265, in getEvent
    self.parser.feed(buf)
  File 
"/opt/STools/lib/python2.3/site-packages/_xmlplus/sax/expatreader.py", 
line 220, in feed
    self._err_handler.fatalError(exc)
  File 
"/opt/STools/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 
38, in fatalError
    raise exception
xml.sax._exceptions.SAXParseException: <unknown>:7:31: reference to 
invalid character number

# end of Traceback

Again, this behavior, raising an exception to "invalid character number" 
&#x8;
is appropriate for the XML 1.0 file but not for the XML 1.1 file.

******************

I have an application that needs XML 1.1, including characters like &#x8;

How can I parse such files in Python (preferably with pulldom, but I'm open
to all suggestions).

Thanks,

Ken