[XML-SIG] XML Unicode and UTF-8
n.youngman at ntlworld.com
Sat Aug 7 21:36:58 CEST 2004
On Saturday 07 Aug 2004 6:59 pm, Mike Brown wrote:
> Neil Youngman wrote:
> > On Thursday 05 Aug 2004 9:27 pm, Mike Brown wrote:
> > > The resulting Unicode object may contain characters which are not
> > > allowed in XML, and thus the text may not be serializable (at least not
> > > in a way that would produce well-formed XML).
> > Yes, but it's being written out through a UTF-8 codec
> Perhaps I wasn't being clear. It doesn't matter what encoding you use. XML
> places restrictions on what characters can be in the *decoded* (Unicode)
> version of the document. The encoded version of the document is just an
> alternative representation of the Unicode one.
> In Python's notation, each character in the document must be one of:
> \t (tab)
> \n (linefeed)
> \r (carriage return)
> You are not allowed to have any other characters in your document, not even
> by reference (e.g., you can't write � to represent \u0000).
> So let's say you have 256 bytes of binary data, just byte values 0-255:
> >>> bytestring = ''.join(map(chr,range(256)))
OK. I think we're starting from different assumptions here. The data comes
from decoding an RFC1522 header. It is therefore assumed to be text, albeit
in a non-ASCII character set. It should not be an arbitrary chunk of binary
I'm assuming, possibly incorrectly, that the standards are set up in such a
way that if it's valid text, it should be possible to insert the equivalent
the UTF-8 equivalent in XML.
While I theoretically could get something that's not valid text, encoded in an
RFC1522 header, it's only going to cause me real concern if it's a security
flaw. If we can't adequately process invalid data, that's not a major concern
for me. If you are saying that there may be text in character sets supported
in Python (with CJK codecs), that I can't insert as plain UTF-8 into a UTF-8
XML document that would be a concern.
More information about the XML-SIG