[I18n-sig] XML and UTF-16

Paul Prescod paulp@ActiveState.com
Thu, 31 May 2001 14:17:18 -0700


Tom Emerson wrote:
> 
>...
> 
> Yes. You can then pretty easily autodetect the which Unicode
> transformation format is being used by looking at the first ten or
> so bytes.

Actually, the first four bytes are sufficient to get you started. Then
you have to look at the encoding declaration if present.

> If the BOM is present, that's a big clue right there.

"""Entities encoded in UTF-16 must begin with the Byte Order Mark
described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC
10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3]
(the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding
signature, not part of either the markup or the character data of the
XML document. XML processors must be able to use this character to
differentiate between UTF-8 and UTF-16 encoded documents."""

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook