[I18n-sig] UTF-8 and BOM

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Mon, 21 May 2001 16:50:41 +0200


> That's hard to implement... how would the codec know where the
> stream starts -- it only interfaces to the underyling stream
> using .read() and .write() ?

The stream readers and writers should assume that the first read and
write operation use the ZWNBSP as the BOM, so they should stop giving
a byte-order meaning to the BOM once they have seen the first chunk of
data. That is best implemented by replacing the .encode function with 
utf_16_be/le_encode (as appropriate).

> Note that this only happens in the UTF-16 codec. All other codecs
> pass through the BOMs as-is. Perhaps I should modify the UTF-16
> codec to only remove BOMs when used in UTF-16 mode (without byte
> order indication) and not in UTF-16-LE/UTF-16-BE mode ?!

You may want to study the RFC just to be sure, but I think this is how
UTF-16-[BL]E are defined.

Regards,
Martin