[XML-SIG] XML Unicode and UTF-8
"Martin v. Löwis"
martin at v.loewis.de
Thu Aug 5 15:30:48 CEST 2004
n.youngman at ntlworld.com wrote:
> Sorry, I missed a key point out. Segment is the decoded part of
> the output from email.Header.decode_header(). I believed this was a
> unicode string, but checking back in the documentation it doesn't
> actually say that, so I guess at least part of the problem is I'm
> getting some sort of binary data, which I thought was Unicode, but
Indeed. decode_header gives you a list of (byte, encoding) pairs
precisely because it does not attempt to decode them. In turn, it
does not try to decode them because Python might not have a codec
for some of the encodings. Generally, you would do
result = 
for h, enc in Header.decode_header(header):
which will raise a LookupError if there is an unsupported encoding.
As you are going to put the header into an XML document, you really
have little choice what to do in that case - if giving up is not
might be your next best choice: this will assume that any encoding
is an ASCII superset, and replace all non-ASCII bytes with question
All that decode_header is is to decode the transfer encoding (i.e.
Q or B).
>>> Leaves binary data in the document. I have assumed that this was
>>> raw Unicode, may be that's a flawed assumption?
> XML doesn't, Python does. If I ask it to print without encoding it, I
> don't know whether it's passed through unchanged. Raw Unicode seems
> to me like a reasonable term for the data in a unicode string.
Ah, that. Don't worry about the internal representation of a Unicode
string. It may have 2 or 4 bytes, and be big or little endian. You
are never going to see that directly, as there is *always* an encoding
going on to convert the Unicode object into a byte string. Of course,
you could create a buffer object to really find out, but that should
not be done.
> You have neatly pinpointed where I was confused. Your assistance is
> much appreciated.
You are welcome!
More information about the XML-SIG