[XML-SIG] pulldom with XML 1.1 problem
Ken Beesley
ken.beesley at xrce.xerox.com
Sat Aug 28 14:44:21 CEST 2004
Newbie problem: pulldom with XML 1.1
The Question:
How can I make pulldom parse according to XML 1.1 conventions?
Or: Is there an upgrade of pulldom that handles XML 1.1?
Or: Is there some other XML 1.1 parsing solution in Python?
Background: I'm running
Python 2.3.3 (#1, Feb 17 2004, 11:48:35)
[GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-5)] on linux2
Illustration of my problem:
I start with the following simple xml file, call it test.xml
<?xml version="1.0" encoding="utf-8"?>
<foo>
<bar>first line of text</bar>
<bar>second line of text</bar>
<bar>third line of text</bar>
<bar>abc</bar>
</foo>
and the following Relax NG schema (compact syntax), call it test.rng
grammar {
start = element foo {
element bar {text}+
}
}
Validation of test.xml succeeds using the Jing validating parser:
java -jar jing.jar -c test.rng test.xml
So far so good.
****** Now for XML 1.0 vs. XML 1.1 ...
In XML 1.0, all characters below x20 are invalid as characters in an XML
file
except for x9, xA and xD.
So if I change test.xml to the following (call it test1.0.xml), adding 
<?xml version="1.0" encoding="utf-8"?>
<foo>
<bar>first line of text</bar>
<bar>second line of text</bar>
<bar>third line of text</bar>
<bar>abc</bar> <!-- N.B. addition of  -->
</foo>
then Jing rightly complains that the file is not XML 1.0 valid, because
of the illegal
 character.
However,  _is_ valid in XML 1.1, so the following file (call it
test1.1.xml)
<?xml version="1.1" encoding="utf-8"?>
<!-- N.B. change in line above to version="1.1" -->
<foo>
<bar>first line of text</bar>
<bar>second line of text</bar>
<bar>third line of text</bar>
<bar>abc</bar> <!-- N.B. addition of  -->
</foo>
is (correctly) accepted by Jing as valid XML 1.1.
************************
Problem: pulldom handles test.xml (which lacks the offending ) but
chokes on both test1.0.xml (which contains an invalid ) and
test1.1.xml
(which contains a valid ).
It should fail for test1.0.xml and succeed for test1.1.xml (just like
Jing does).
Here's a little test script (call it test.py) using pulldom to print the
text in each
<bar> element:
#!/usr/bin/env python
import sys
from xml.dom import pulldom
infile = sys.argv[1]
events = pulldom.parse(infile)
def getText(nodelist):
rc = ""
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc += node.data
return rc
for (event, node) in events:
if event == pulldom.START_ELEMENT and node.tagName == "bar":
events.expandNode(node)
print getText(node.childNodes)
# end of script
Invoking from the command line
test.py test.xml
succeeds and outputs
first line of text
second line of text
third line of text
abc
But invoking
test.py test1.0.xml
or
test.py test1.1.xml
fails and gives the following traceback:
Traceback (most recent call last):
File "test.py", line 17, in ?
for (event, node) in events:
File
"/opt/STools/lib/python2.3/site-packages/_xmlplus/dom/pulldom.py", line
232, in next
rc = self.getEvent()
File
"/opt/STools/lib/python2.3/site-packages/_xmlplus/dom/pulldom.py", line
265, in getEvent
self.parser.feed(buf)
File
"/opt/STools/lib/python2.3/site-packages/_xmlplus/sax/expatreader.py",
line 220, in feed
self._err_handler.fatalError(exc)
File
"/opt/STools/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line
38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: <unknown>:7:31: reference to
invalid character number
# end of Traceback
Again, this behavior, raising an exception to "invalid character number"

is appropriate for the XML 1.0 file but not for the XML 1.1 file.
******************
I have an application that needs XML 1.1, including characters like 
How can I parse such files in Python (preferably with pulldom, but I'm open
to all suggestions).
Thanks,
Ken
More information about the XML-SIG
mailing list