[XML-SIG] stripping 8-bit ASCII from an XML stream using encode?
Kevin Altis
altis@semi-retired.com
Fri, 4 Jan 2002 22:20:37 -0800
The code at the end of demonstrates a problem I'm having parsing XML files
downloaded from SourceForge. The specific issue is that there are characters
in the XML stream above ASCII 127; decimal 233 and 246 in the example file.
The cleanText function is supposed to strip the problem characters. Some of
the files contain non-printing ASCII characters below decimal 32, so I'm
trying to strip those manually after the XML is converted. Once the file is
parsed I'm using the fields in a GUI interface to the SourceForge tracker
database.
Here is the URL for the XML version of the Python Feature Requests:
http://sourceforge.net/export/sf_tracker_export.php?atid=355470&group_id=547
0
I thought that encode could be used to strip these characters, but I
sometimes get the following traceback.
t = t.encode('ascii', 'ignore')
UnicodeError: ASCII decoding error: ordinal not in range(128)
I haven't done much XML processing, so this could be a FAQ, but I haven't
been able to find the answer so far. What is the proper way to strip the
8-bit values? Is there another issue at work here?
Thanks,
ka
---
Kevin Altis
altis@semi-retired.com
---
example code by Mark Pilgrim:
def cleanText(t, collapseWhitespace=0):
t = t.encode('ascii', 'ignore')
t = t.replace(chr(19), '')
if collapseWhitespace:
t = t.replace('\t', '').replace('\n', '')
return t
def getText(node, collapseWhitespace=0):
return cleanText("".join([c.data for c in node.childNodes if c.nodeType
== c.TEXT_NODE]), collapseWhitespace)
def doParse(xml):
from xml.dom import minidom
xml = cleanText(xml)
xmldoc = minidom.parseString(xml)
artifacts = xmldoc.getElementsByTagName('artifact')
trackerDict = {}
for a in artifacts:
trackerDict[a.attributes["id"].value] = \
{"summary":getText(a.getElementsByTagName("summary")[0],
collapseWhitespace=1),
"detail":getText(a.getElementsByTagName('detail')[0])}
return trackerDict
if __name__ == '__main__':
# KEA
# added code to download URL and save the file
# comment out once you have the have the XML
# the XML file is approximately 174K
import urllib
url =
'http://sourceforge.net/export/sf_tracker_export.php?atid=355470&group_id=54
70'
fp = urllib.urlopen(url)
xml = fp.read()
fp.close()
filename = r'Python_FeatureRequests.xml'
op = open(filename, 'wb')
op.write(xml)
op.close()
fsock = open(filename)
xml = fsock.read()
fsock.close()
trackerDict = doParse(xml)
import pprint
pprint.pprint(trackerDict)