suppressing bad characters in output PCDATA (converting JSON to XML)
Adam Funk
a24061 at ducksburg.com
Tue Nov 29 07:57:22 EST 2011
On 2011-11-28, Stefan Behnel wrote:
> Adam Funk, 25.11.2011 14:50:
>> I'm converting JSON data to XML using the standard library's json and
>> xml.dom.minidom modules. I get the input this way:
>>
>> input_source = codecs.open(input_file, 'rb', encoding='UTF-8', errors='replace')
>
> It doesn't make sense to use codecs.open() with a "b" mode.
OK, thanks.
>> big_json = json.load(input_source)
>
> You shouldn't decode the input before passing it into json.load(), just
> open the file in binary mode. Serialised JSON is defined as being UTF-8
> encoded (or BOM-prefixed), not decoded Unicode.
So just do
input_source = open(input_file, 'rb')
big_json = json.load(input_source)
?
>> input_source.close()
>
> In case of a failure, the file will not be closed safely. All in all, use
> this instead:
>
> with open(input_file, 'rb') as f:
> big_json = json.load(f)
OK, thanks.
>> Then I recurse through the contents of big_json to build an instance
>> of xml.dom.minidom.Document (the recursion includes some code to
>> rewrite dict keys as valid element names if necessary)
>
> If the name "big_json" is supposed to hint at a large set of data, you may
> want to use something other than minidom. Take a look at the
> xml.etree.cElementTree module instead, which is substantially more memory
> efficient.
Well, the input file in this case contains one big JSON list of
reasonably sized elements, each of which I'm turning into a separate
XML file. The output files range from 600 to 6000 bytes.
>> and I save the document:
>>
>> xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', errors='replace')
>> doc.writexml(xml_file, encoding='UTF-8')
>> xml_file.close()
>
> Same mistakes as above. Especially the double encoding is both unnecessary
> and likely to fail. This is also most likely the source of your problems.
Well actually, I had the problem with the occasional control
characters in the output *before* I started sticking encoding="UTF-8"
all over the place (in an unsuccessful attempt to beat them down).
>> I thought this would force all the output to be valid, but xmlstarlet
>> gives some errors like these on a few documents:
>>
>> PCDATA invalid Char value 7
>> PCDATA invalid Char value 31
>
> This strongly hints at a broken encoding, which can easily be triggered by
> your erroneous encode-and-encode cycles above.
No, I've checked the JSON input and those exact control characters are
there too. I want to suppress them (delete or replace with spaces).
> Also, the kind of problem you present here makes it pretty clear that you
> are using Python 2.x. In Python 3, you'd get the appropriate exceptions
> when trying to write binary data to a Unicode file.
Sorry, I forgot to mention the version I'm using, which is "2.7.2+".
--
In the 1970s, people began receiving utility bills for
-£999,999,996.32 and it became harder to sustain the
myth of the infallible electronic brain. (Stob 2001)
More information about the Python-list
mailing list