[Python-Dev] Unicode entities in XML cause problems :-(

Martin v. Loewis martin@v.loewis.de
28 Apr 2002 11:16:44 +0200

"Matthias Urlichs" <smurf@noris.de> writes:

> > The proper fix, IMO, is to have writexml accept an encoding argument,
> > and, by default, write the output as UTF-8. Then there is no need for
> > character or entity references.
> > 
> The encoding should probably default to the one from the document header
> (UTF-8 if that isn't given).

In .toxml, we are going to *create* a document header. We can put
anything into there that we want.

If you think that the encoding should be the one that the "original"
document had - that cannot work. First, the parser does not provide
that information, and the DOM does not preserve it. Furthermore, there
doesn't even *have* to be an original document - the DOM tree could
have been created from scratch.

> For XML escaping, the approach suggested by this patch would be to use
> xmlcharrefreplace() (see the test script) as the error handler.
> But that doesn't help with &<>". Personally, I rather dislike having to do
> a separate replace() for these.
> One approach would be to use character maps which have strategic holes
> where & < > and possibly " live..?

Depends on your output encoding. If you want to use us-ascii as an
output encoding, then it would be easy to create a character map codec
that has holes for these characters. 

If the user wants to specify the output encoding, this may be more
difficult, since the codec for the output encoding may not be based on
character maps. Since this is application that the SF patch has in
mind, I doubt you can avoid the replace calls.