[I18n-sig] XML and UTF-16

M.-A. Lemburg mal@lemburg.com
Fri, 01 Jun 2001 10:17:04 +0200

Paul Prescod wrote:
> Tom Emerson wrote:
> >
> >...
> >
> > Yes. You can then pretty easily autodetect the which Unicode
> > transformation format is being used by looking at the first ten or
> > so bytes.
> Actually, the first four bytes are sufficient to get you started. Then
> you have to look at the encoding declaration if present.
> > If the BOM is present, that's a big clue right there.
> """Entities encoded in UTF-16 must begin with the Byte Order Mark
> described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC
> 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3]
> (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding
> signature, not part of either the markup or the character data of the
> XML document. XML processors must be able to use this character to
> differentiate between UTF-8 and UTF-16 encoded documents."""

Where did you get that from ? Note that the Unicode specs have a 
different opinion on this... (a BOM mark is part of a protocol and
should only be used if the encoding information is not 
available in some other form or implicit)

Marc-Andre Lemburg
CEO eGenix.com Software GmbH
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/