suppressing bad characters in output PCDATA (converting JSON to XML)
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Mon Nov 28 07:11:15 EST 2011
On Fri, 25 Nov 2011 13:50:01 +0000, Adam Funk wrote:
> I'm converting JSON data to XML using the standard library's json and
> xml.dom.minidom modules. I get the input this way:
>
> input_source = codecs.open(input_file, 'rb', encoding='UTF-8',
> errors='replace') big_json = json.load(input_source)
> input_source.close()
>
> Then I recurse through the contents of big_json to build an instance of
> xml.dom.minidom.Document (the recursion includes some code to rewrite
> dict keys as valid element names if necessary),
How are you doing that? What do you consider valid?
> and I save the document:
>
> xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8',
> errors='replace') doc.writexml(xml_file, encoding='UTF-8')
> xml_file.close()
>
>
> I thought this would force all the output to be valid, but xmlstarlet
> gives some errors like these on a few documents:
It will force the output to be valid UTF-8 encoded to bytes, not
necessarily valid XML.
> PCDATA invalid Char value 7
> PCDATA invalid Char value 31
What's xmlstarlet, and at what point does it give this error? It doesn't
appear to be in the standard library.
> I guess I need to process each piece of PCDATA to clean out the control
> characters before creating the text node:
>
> text = doc.createTextNode(j)
> root.appendChild(text)
>
> What's the best way to do that, bearing in mind that there can be
> multibyte characters in the strings?
Are you mixing unicode and byte strings?
Are you sure that the input source is actually UTF-8? If not, then all
bets are off: even if the decoding step works, and returns a string, it
may contain the wrong characters. This might explain why you are getting
unexpected control characters in the output: they've come from a badly
decoded input.
Another possibility is that your data actually does contain control
characters where there shouldn't be any.
--
Steven
More information about the Python-list
mailing list