[I18n-sig] UTF-8 and BOM

M.-A. Lemburg mal@lemburg.com
Mon, 21 May 2001 19:02:35 +0200

"Martin v. Loewis" wrote:
> > That's hard to implement... how would the codec know where the
> > stream starts -- it only interfaces to the underyling stream
> > using .read() and .write() ?
> The stream readers and writers should assume that the first read and
> write operation use the ZWNBSP as the BOM, so they should stop giving
> a byte-order meaning to the BOM once they have seen the first chunk of
> data. That is best implemented by replacing the .encode function with
> utf_16_be/le_encode (as appropriate).

Patches are welcome :-)
> > Note that this only happens in the UTF-16 codec. All other codecs
> > pass through the BOMs as-is. Perhaps I should modify the UTF-16
> > codec to only remove BOMs when used in UTF-16 mode (without byte
> > order indication) and not in UTF-16-LE/UTF-16-BE mode ?!
> You may want to study the RFC just to be sure, but I think this is how
> UTF-16-[BL]E are defined.

According to the Unicode FAQ, BOM marks should only be used
where the byte order is not immediatly clear. In the case -LE and
-BE, this information is available, which is why the codecs
don't prepend a BOM mark.

Ok, I will modify the UTF-16-LE and -BE decoders to not remove
BOMs anymore and fix the UTF-16 decoder to only remove BOMs at
the start of the string. With these changes you should be able
to fix the UTF-16 stream codec to be more RFC compliant.

Marc-Andre Lemburg
CEO eGenix.com Software GmbH
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/