writing Unicode objects to XML

Alex Martelli aleax at aleax.it
Mon May 5 07:04:38 EDT 2003


Alessio Pace wrote:
   ...
> Maybe I am missing something, because I tried but in the resulting new XML
> file I dont' see what I expect.. Starting again, I have an XML file
> declared encoded in UTF-8 (anyway, is it the default if I don't specify

Yes.

> anything?) and which contains character references such as
> è and some others in the Text nodes. I parse it with

There is no way, in XML, to specify which characters will be encoded in the
native encoding (e.g. '\xc3\xa8' in utf-8 in this case) and which ones will
be encoded using character references instead.

E.g., your original XML file could be:

<?xml version="1.0" encoding="utf-8">
<foo>
\xc3\xa8\xc3\xa8\xc3\xa8&#xe8;&#xe8;&#xe8;
</foo>

OR, *totally indifferently*:

<?xml version="1.0" encoding="utf-8">
<foo>
&#xe8;&#xe8;&#xe8;\xc3\xa8\xc3\xa8\xc3\xa8
</foo>

There represent EXACTLY the same XML document (the textnode is made
up of six accented e characters in each case).  There is no way, in
XML itself, to distinguish which of the two to the sixth power (64)
exactly equivalent ways of representing these six accented e's may
have been used in the file -- each one can independently choose a
character reference, or native representation.


> xml.dom.minidom.parse(pathToFile) and get a reference to a DOM tree, let's
> call this variable 'xmldoc'. Now, let's say I want to store again this DOM
> tree (because my application will have to modify some parameters in it). I

So far so good, you can emit this XML document into an XML file as follows.

> thought I had to do just:
> f = codecs.open('file.xml', 'w', 'utf8')
> f.write(xmldoc.toxml(encoding='utf-8') )
> f.close()
> But the result is not the original xml....

No, here you have some redundancy too -- with the encoding= method toxml
emits an already-encoded string and then you're passing it to a codecs.open
object that's trying to decode and re-encode it all over again, with much
wasted effort -- the output of method toxml is a plain, already encoded
string, suitable for passing to a *file*'s crite method.

But, that is not the issue.  You *ARE* getting "the original XML", but
you seem to labor under the false assumption that "the original XML"
somehow imples (or at least implies given an encoding) "the same string
of bytes".  It doesn't, of course.  There are MANY streams of bytes,
even given an encoding, that could represent exactly the same XML.  Besides
the issue of character references, think for example how ANY piece of text
MIGHT indifferently be represented as CDATA... or MIGHT NOT, in a way that
XML *defines* to be totally identical, indifferent, interchangeable.

So, if you're labouring under further, non-XML constraints, you have to
employ non-XML means to try and meet those constraints.  It would be very
nice to find out exactly what those constraints are, in minute detail,
since surely you cannot rely on the XML standardization documents (nor on
any documents that in turn rely on those, such as all of the various
application-area specific DTD's and the like) to tell you.

For example, do you have to reproduce the exact representation choice
made for each single text character in the input stream (cdata versus
plain text versus chracter reference)?  That's gonna be a TALL order
indeed -- and what will you do if the changes you make insert any new
text character whatsoever?

Maybe you can get away with something much simpler, such as, e.g., "even
though the encoding chosen would be perfectly able to represent directly
all Unicode characters, nevertheless, in order to satisfy a PHB who gives
what he THINKS are XML-related specs but has never read one line of the
XML standards, still we have to represent all characters outside of the
ASCII range as character references" (or, "all characters whose Unicode
code is even" -- just about as meaningful).

If you can find semi-sensible specs of this kind, then you can
post-process the string produced by the toxml method in order to
satisfy them.  But if your specs are not even semi-sensible, it WILL
be a lot of work.  The XML tools themselves don't help much because,
of course, they DO deal with XML issues, and not with the NON-XML
ones you seem (apparently without realizing) to be struggling with.

> My sys.defaultencoding  is iso-8859-1, specified in the sitecustomize.py
> script in python site-packages directory.

This is not affecting your specific problem in any way at all.

> Thank you in advance.

You're welcome, but I suspect the answer's not what you wanted to hear.


Alex





More information about the Python-list mailing list