[XML-SIG] losing entities when parsing then texting

Andrew Clover and-xml at doxdesk.com
Wed Jul 6 15:19:56 CEST 2005


Greg Wilson <gvwilson at cs.utoronto.ca> wrote:

> I realize I should include the Unicode characters directly in my files,
> but that's not possible in this case---I have to accommodate people who
> are using editors that only handle 7-bit ASCII.

Theoretically, .toxml('us-ascii') should generate usable output. 
Unfortunately minidom doesn't really do this properly and you'll get a 
UnicodeError.

As a workaround you could just take the UTF-8 encoded version and 
.encode('us-ascii', 'xmlcharrefreplace') on it... which is technically 
the wrong thing if nodeNames or CDATASections or whatever have non-ASCII 
characters in, but that probably doesn't matter to you.

ObStandardPlug: pxdom supports both proper charref-escaping (using 
DOM3LS DOMOutput.encoding) and keeping EntityReference nodes (using 
DOM3Core DOMConfiguration.setParameter('entities', True) or 
pxdom.parse(file, {'entities': True}).)

-- 
Andrew Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/


More information about the XML-SIG mailing list