[I18n-sig] XML and UTF-16

M.-A. Lemburg mal@lemburg.com
Fri, 01 Jun 2001 15:57:11 +0200


"Martin v. Loewis" wrote:
> 
> > > """Entities encoded in UTF-16 must begin with the Byte Order Mark
> > > described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC
> > > 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3]
> > > (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding
> > > signature, not part of either the markup or the character data of the
> > > XML document. XML processors must be able to use this character to
> > > differentiate between UTF-8 and UTF-16 encoded documents."""
> >
> > Where did you get that from ?
> 
> That's from the XML recommendation, section 4.3.3. I really recommend
> that you get a copy of that document :-)

Just did... :)
 
> > Note that the Unicode specs have a different opinion on this... (a
> > BOM mark is part of a protocol and should only be used if the
> > encoding information is not available in some other form or
> > implicit)
> 
> Why is that different? XML says that the BOM is not part of the
> document, but an encoding signature. You say that that it is part of a
> protocol - in the XML case, it is part of the encoding autodetection
> protocol.
> 
> If the character was part of the document, any document containing it
> would be ill-formed, since the ZWNBSP is not allowed as the first
> character of an XML document (only whitespace and '<' are allowed,
> AFAICT).

In that sense you are right. I was under the impression that the
quoted text was talking about UTF-16 documents in general (not just
only XML docs).

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/