suppressing bad characters in output PCDATA (converting JSON to XML)

Adam Funk a24061 at ducksburg.com
Fri Nov 25 08:50:01 EST 2011


I'm converting JSON data to XML using the standard library's json and
xml.dom.minidom modules.  I get the input this way:

input_source = codecs.open(input_file, 'rb', encoding='UTF-8', errors='replace')
big_json = json.load(input_source)
input_source.close()

Then I recurse through the contents of big_json to build an instance
of xml.dom.minidom.Document (the recursion includes some code to
rewrite dict keys as valid element names if necessary), and I save the
document:

xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', errors='replace')
doc.writexml(xml_file, encoding='UTF-8')
xml_file.close()


I thought this would force all the output to be valid, but xmlstarlet
gives some errors like these on a few documents:

PCDATA invalid Char value 7
PCDATA invalid Char value 31

I guess I need to process each piece of PCDATA to clean out the
control characters before creating the text node:

  text = doc.createTextNode(j)
  root.appendChild(text)

What's the best way to do that, bearing in mind that there can be
multibyte characters in the strings?  I found some suggestions on the
WWW involving filter with string.printable, which AFAICT isn't
unicode-friendly --- is there a unicode.printable or something like
that?


-- 
"Mrs CJ and I avoid clichés like the plague."



More information about the Python-list mailing list