suppressing bad characters in output PCDATA (converting JSON to XML)

Tue Nov 29 09:33:57 EST 2011

Adam Funk, 29.11.2011 13:57:
> On 2011-11-28, Stefan Behnel wrote:
>> Adam Funk, 25.11.2011 14:50:
>>> Then I recurse through the contents of big_json to build an instance
>>> of xml.dom.minidom.Document (the recursion includes some code to
>>> rewrite dict keys as valid element names if necessary)
>>
>> If the name "big_json" is supposed to hint at a large set of data, you may
>> want to use something other than minidom. Take a look at the
>> xml.etree.cElementTree module instead, which is substantially more memory
>> efficient.
>
> Well, the input file in this case contains one big JSON list of
> reasonably sized elements, each of which I'm turning into a separate
> XML file.  The output files range from 600 to 6000 bytes.

It's also substantially easier to use, but if your XML writing code works 
already, why change it.

>>> and I save the document:
>>>
>>> xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', errors='replace')
>>> doc.writexml(xml_file, encoding='UTF-8')
>>> xml_file.close()
>>
>> Same mistakes as above. Especially the double encoding is both unnecessary
>> and likely to fail. This is also most likely the source of your problems.
>
> Well actually, I had the problem with the occasional control
> characters in the output *before* I started sticking encoding="UTF-8"
> all over the place (in an unsuccessful attempt to beat them down).

You should read up on Unicode a bit.

>>> I thought this would force all the output to be valid, but xmlstarlet
>>> gives some errors like these on a few documents:
>>>
>>> PCDATA invalid Char value 7
>>> PCDATA invalid Char value 31
>>
>> This strongly hints at a broken encoding, which can easily be triggered by
>> your erroneous encode-and-encode cycles above.
>
> No, I've checked the JSON input and those exact control characters are
> there too.

Ah, right, I didn't look closely enough. Those are forbidden in XML:

http://www.w3.org/TR/REC-xml/#charsets

It's sad that minidom (apparently) lets them pass through without even a 
warning.

> I want to suppress them (delete or replace with spaces).

Ok, then you need to process your string content while creating XML from 
it. If replacing is enough, take a look at string.maketrans() in the string 
module and str.translate(), a method on strings. Or maybe just use a 
regular expression that matches any whitespace character and replace it 
with a space. Or whatever suits your data best.

>> Also, the kind of problem you present here makes it pretty clear that you
>> are using Python 2.x. In Python 3, you'd get the appropriate exceptions
>> when trying to write binary data to a Unicode file.
>
> Sorry, I forgot to mention the version I'm using, which is "2.7.2+".

Yep, Py2 makes Unicode handling harder than it should be.

Stefan