[I18n-sig] XML and UTF-16

Tom Emerson tree@basistech.com
Thu, 31 May 2001 13:52:11 -0400

M.-A. Lemburg writes:
> Perhaps we should have some smart auto-detection API somewhere
> which does this automagically ?! Something like
> 	guess_xml_encoding(data) -> encoding string
> It could work by looking at the first 256 bytes of the data
> string and then apply all the tricks needed to extract the
> encoding information (or default to UTF-8 if no such information
> is given).

Yes, I think this would be a good idea. I would use something along
the lines of:

0) Assume UTF-8.

1) Look for the UTF-16 and UTF-32 uniBOMs. If you find one, assume the
   appropriate transmission format and endian nature. Goto 4.

2) Look for the UTF-8 uniBOM, since some editors like putting that in.
   Ignore it and goto 4.

3) Look for the sundry forms of '<?xml ' in ASCII, UTF-16, and UTF-32,
   with appropriate endian variants. If found, assume the detected
   encoding. Goto 4.

4) Look for the encoding attribute to the XML directive and validate
   it against the detected encoding. If we detected that the file is
   in UTF-16BE but the encoding attribute claims UTF-8, something is
   wrong somewhere.

5) If the value of the encoding attribute is consistent with the
   detected encoding, continue and possibly instantiate the
   appropriate transcoders for the document (e.g., if you see
   something like "<?xml version='1.0' encoding='gb-2312'>")

Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"