[I18n-sig] Mixed encodings and XML

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Thu, 14 Dec 2000 04:05:01 +0100

> How would one go about creating a well-formed XML document with multiple
> encodings?

As others have pointed out: You don't. XML documents are in
Unicode. They may have some other encoding *for transfer*, but
conceptually, they are still in Unicode.

> It contains many sections within HTML PREs with the different encodings
> I mentioned.  They look like
> <PRE LANG="zh-TW">
> ... BIG5-encoded stuff ...
> </PRE>

So what you really want is to include binary data in a tag. As you've
explained yourself when answering to Marc-Andre: That is not supported
in XML. Of course, if XML had a BDATA type (or section) you could
include a binary data fragment, and then any presentation tool would
have to provide visualization (such as opening a hex editor on

In the specific case of cjkv.doc, I guess the best approach would be:
- use Python string escapes in Python code, e.g.
  sjisStr = "\0x88\0xc0\0x91\0x53\0x82\0xc9\0x8e\0x67\0x82\0xa6\0x82\0xe9"
  # Shift-JIS encoded source string
- use Unicode text data where output is intended to be displayed properly
- don't cite the output if it will come out as gibberish on any terminal
  (e.g. when printing both SJIS and UTF-8 on the same terminal). Instead,
  explain what the user will likely see.