suppressing bad characters in output PCDATA (converting JSON to XML)
Adam Funk
a24061 at ducksburg.com
Tue Nov 29 07:50:59 EST 2011
On 2011-11-28, Steven D'Aprano wrote:
> On Fri, 25 Nov 2011 13:50:01 +0000, Adam Funk wrote:
>
>> I'm converting JSON data to XML using the standard library's json and
>> xml.dom.minidom modules. I get the input this way:
>>
>> input_source = codecs.open(input_file, 'rb', encoding='UTF-8',
>> errors='replace') big_json = json.load(input_source)
>> input_source.close()
>>
>> Then I recurse through the contents of big_json to build an instance of
>> xml.dom.minidom.Document (the recursion includes some code to rewrite
>> dict keys as valid element names if necessary),
>
> How are you doing that? What do you consider valid?
Regex-replacing all whitespace ('\s+') with '_', and adding 'a_' to
the beginning of any potential tag that doesn't start with a letter.
This is good enough for my purposes.
>> I thought this would force all the output to be valid, but xmlstarlet
>> gives some errors like these on a few documents:
>
> It will force the output to be valid UTF-8 encoded to bytes, not
> necessarily valid XML.
Yes!
>> PCDATA invalid Char value 7
>> PCDATA invalid Char value 31
>
> What's xmlstarlet, and at what point does it give this error? It doesn't
> appear to be in the standard library.
It's a command-line tool I use a lot for finding the bad bits in XML,
nothing to do with python.
http://xmlstar.sourceforge.net/
>> I guess I need to process each piece of PCDATA to clean out the control
>> characters before creating the text node:
>>
>> text = doc.createTextNode(j)
>> root.appendChild(text)
>>
>> What's the best way to do that, bearing in mind that there can be
>> multibyte characters in the strings?
>
> Are you mixing unicode and byte strings?
I don't think I am.
> Are you sure that the input source is actually UTF-8? If not, then all
> bets are off: even if the decoding step works, and returns a string, it
> may contain the wrong characters. This might explain why you are getting
> unexpected control characters in the output: they've come from a badly
> decoded input.
I'm pretty sure that the input is really UTF-8, but has a few control
characters (fairly rare).
> Another possibility is that your data actually does contain control
> characters where there shouldn't be any.
I think that's the problem, and I'm looking for an efficient way to
delete them from unicode strings.
--
Some say the world will end in fire; some say in segfaults.
[XKCD 312]
More information about the Python-list
mailing list