[I18n-sig] XML and UTF-16

Tom Emerson tree@basistech.com
Thu, 31 May 2001 13:23:31 -0400


M.-A. Lemburg writes:
> What is the standard file layout to use for storing an XML file
> in UTF-16 ?

I thought this was covered in the XML specification as a non-normative
appendix. Maybe not.

> 1) encode the whole file in UTF-16 (possibly prepended with a BOM)

Yes. You can then pretty easily autodetect the which Unicode
transformation format is being used by looking at the first ten or
so bytes.

If the BOM is present, that's a big clue right there.

UTF-16-BE will have the first "<?xml " encoded like

003C 003F 0078 006D 006E

while UTF-16-LE will have it encoded as

3C00 3F00 7800 6D00 6E00

ASCII and UTF-8 will just have

3C 3F 78 6D 6E

> 2) write the first line containing the XML header (which has the
>    encoding information) in ASCII and then proceed with UTF-16
>    starting after the newline character

Ugh, no.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"