writing Unicode objects to XML

Martin v. Löwis martin at v.loewis.de
Mon May 5 17:40:55 EDT 2003


Steven Taschuk <staschuk at telusplanet.net> writes:

> > There is no way, in XML, to specify which characters will be encoded in the
> > native encoding (e.g. '\xc3\xa8' in utf-8 in this case) and which ones will
> > be encoded using character references instead.
> 
> A nit: whether this is true is a property of one's XML tools, not
> a property of XML itself.  It is easy to imagine XML writers with
> all sorts of policies about character encoding.  (See below.)

Well, no. There is a notion of the "XML Information Set", see

http://www.w3.org/TR/xml-infoset/

In 2.6, the notion of a "Character Information Item" is introduced.

# There is a character information item for each data character that
# appears in the document, whether literally, as a character
# reference, or within a CDATA section.

The information of a character information item does *not* indicate
whether the character was encoding in its source encoding, or using as
a character reference. "Not being part of the XML infoset" is really
the same thing as "no way in XML".

Regards,
Martin




More information about the Python-list mailing list