[XML-SIG] Problems with "ignorable whitespace" in python's minidom
and pulldom !
Arno Wilhelm
quirxi at aon.at
Thu Mar 11 03:40:14 EST 2004
Hello,
I hope this is the right mailinglist for this kind of topic. If not, do not
hestiate to ignore this posting or direct me to another mailing list.
Here is the problem:
My application is a web server centered programm that uses mod_python and xml
has to process xml files. These xml files have most of the time ignorable white
spaces like \n, \r \t between the different tags. The problem is that minidom
seems to interpret these white spaces as text nodes and I cannot know in before
how many of these "text nodes" are in between the real data nodes. This seems to
disturb the real structure of the dom tree and child nodes are no longer child
nodes etc. That makes it hard to write a reliable xml application since I cannot
know how many spaces the writer/editor of the xml file has put in between the
tags. So I tried to find a way of getting rid of these unwanted text nodes with
this piece of code but that did not help either:
################################################################################
#
################################################################################
def cleanUpNodes( nodes ):
"""Removes all TEXT_NODES in parameter nodes that contain only characters
that are defined as whitespace in the string library"""
for node in nodes.childNodes:
if node.nodeType == Node.TEXT_NODE:
node.data = string.strip(node.data)
nodes.normalize()
################################################################################
#
################################################################################
I tried out also pulldom, but it interprets the white spaces as "CHARACTER"
envents and not as "IGNORABLE_WHITSPACE" events. Another thing is that pulldom
seems to never generates an "END_DOCUMENT" event ?!
The big question is:
Does anybody know a way around this problem ?
Am I missing something ?
How can I get rid of this unwanted white-space-text-nodes ?
Here is an example that shows what the same code inteprets as child node when
processing the same xml file without and with white spaces in between the tags:
<############### XML File with white spaces #################>
<root>
<child_1>
<child_11>
<child_111 path="/qpers_data/" proto="file" />
</child_11>
</child_1>
<child_2 type="admin" status="active" label="root">
<child_21 path="/qnodes/admin/admin_root.xml" proto="file" />
</child_2>
</root>
<############################# Code #############################>
#!/usr/bin/python
from xml.dom import minidom
from xml.dom import Node
import string
################################################################################
def cleanUpNodes( nodes ):
"""Removes all TEXT_NODES in parameter nodes that contain only characters
that are defined as whitespace in the string library"""
for node in nodes.childNodes:
if node.nodeType == Node.TEXT_NODE:
node.data = string.strip(node.data)
nodes.normalize()
###############################################################################
def dumpTree( xmlFileIn, xmlFileOut ):
try:
dom = minidom.parse( xmlFileIn )
file = open( xmlFileOut, "w" )
except IOError, (errno, strerror):
print "I/O error(%s): %s" % (errno, strerror )
return
cleanUpNodes( dom.documentElement )
for node in dom.documentElement.childNodes:
while ( node ):
file.write( "\n node ->" + node.nodeName )
file.write( node.toxml('ISO-8859-1') )
node = node.firstChild
file.close()
return 1
###############################################################################
dumpTree( "index_wos.xml", "without_space.xml" )
<####################### Output with XML with whitespace ####################>
node ->child_1<child_1>
<child_11>
<child_111 path="/qpers_data/" proto="file"/>
</child_11>
</child_1>
node ->#text
node ->child_2<child_2 label="root" status="active" type="admin">
<child_21 path="/qnodes/admin/admin_root.xml" proto="file"/>
</child_2>
node ->#text
<#################### Output with XML without whitespace ####################>
node ->child_1<child_1><child_11><child_111 path="/qpers_data/"
/proto="file"/></child_11></child_1>
node ->child_11<child_11><child_111 path="/qpers_data/" /proto="file"/></child_11>
node ->child_111<child_111 path="/qpers_data/" proto="file"/>
node ->child_2<child_2 label="root" status="active" type="admin"><child_21
/path="/qnodes/admin/admin_root.xml" proto="file"/></child_2>
node ->child_21<child_21 path="/qnodes/admin/admin_root.xml" proto="file"/>
regards,
Arno Wilhelm
More information about the XML-SIG
mailing list