[XML-SIG] Problems with "ignorable whitespace" in python's minidom and pulldom !

Arno Wilhelm quirxi at aon.at
Thu Mar 11 03:40:14 EST 2004


Hello,

I hope this is the right mailinglist for this kind of topic. If not, do not 
hestiate to ignore this posting or direct me to another mailing list.

Here is the problem:
My application is a web server centered programm that uses mod_python and xml 
has to process xml files. These xml files have most of the time ignorable white 
spaces like \n, \r \t between the different tags. The problem is that minidom 
seems to interpret these white spaces as text nodes and I cannot know in before 
how many of these "text nodes" are in between the real data nodes. This seems to 
disturb the real structure of the dom tree and child nodes are no longer child 
nodes etc. That makes it hard to write a reliable xml application since I cannot 
know how many spaces the writer/editor of the xml file has put in between the 
tags. So I tried to find a way of getting rid of these unwanted text nodes with 
this piece of code but that did not help either:


################################################################################
#
################################################################################
def cleanUpNodes( nodes ):
	"""Removes all TEXT_NODES in parameter nodes that contain only 		characters
	that are defined as whitespace in the string library"""

	for node in nodes.childNodes:
		if node.nodeType == Node.TEXT_NODE:
			node.data = string.strip(node.data)
	nodes.normalize()

################################################################################
#
################################################################################


I tried out also pulldom, but it interprets the white spaces as "CHARACTER" 
envents and not as "IGNORABLE_WHITSPACE" events. Another thing is that pulldom 
seems to never generates an "END_DOCUMENT" event ?!

The big question is:
Does anybody know a way around this problem ?
Am I missing something ?
How can I get rid of this unwanted white-space-text-nodes ?

Here is an example that shows what the same code inteprets as child node when 
processing the same xml file without and with white spaces in between the tags:


<############### XML File with white spaces #################>
<root>

	<child_1>
		<child_11>
			<child_111 path="/qpers_data/" proto="file" />
		</child_11>
	</child_1>

	<child_2 type="admin" status="active" label="root">
		<child_21 path="/qnodes/admin/admin_root.xml" proto="file" />
	</child_2>

</root>

<############################# Code #############################>

#!/usr/bin/python

from xml.dom import minidom
from xml.dom import Node
import string

################################################################################
def cleanUpNodes( nodes ):
	"""Removes all TEXT_NODES in parameter nodes that contain only characters
	that are defined as whitespace in the string library"""
	for node in nodes.childNodes:
		if node.nodeType == Node.TEXT_NODE:
			node.data = string.strip(node.data)
	nodes.normalize()

###############################################################################
def dumpTree( xmlFileIn, xmlFileOut ):
	
	try:
		dom = minidom.parse( xmlFileIn )
		file = open( xmlFileOut, "w" )
	except IOError, (errno, strerror):
		print "I/O error(%s): %s" % (errno, strerror )
		return
	
	cleanUpNodes( dom.documentElement )
	for node in dom.documentElement.childNodes:
		
		while ( node ):
			file.write( "\n node ->" + node.nodeName )
			file.write( node.toxml('ISO-8859-1') )
			node = node.firstChild

	file.close()
		
	return 1

###############################################################################
dumpTree( "index_wos.xml", "without_space.xml" )




<####################### Output with XML with whitespace ####################>

  node ->child_1<child_1>
		<child_11>
			<child_111 path="/qpers_data/" proto="file"/>
		</child_11>
	</child_1>
  node ->#text
		
  node ->child_2<child_2 label="root" status="active" type="admin">
		<child_21 path="/qnodes/admin/admin_root.xml" proto="file"/>
	</child_2>
  node ->#text


<#################### Output with XML without whitespace ####################>

  node ->child_1<child_1><child_11><child_111 path="/qpers_data/" 
/proto="file"/></child_11></child_1>
  node ->child_11<child_11><child_111 path="/qpers_data/" /proto="file"/></child_11>
  node ->child_111<child_111 path="/qpers_data/" proto="file"/>
  node ->child_2<child_2 label="root" status="active" type="admin"><child_21 
/path="/qnodes/admin/admin_root.xml" proto="file"/></child_2>
  node ->child_21<child_21 path="/qnodes/admin/admin_root.xml" proto="file"/>



regards,


Arno Wilhelm



More information about the XML-SIG mailing list