[I18n-sig] UTF-8 and BOM

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Thu, 17 May 2001 06:22:42 +0200


> Why should a BOM behave any different than any other Unicode
> character ? BOMs can be added and deleted in pretty much all
> places of a Unicode text -- that's their intent after all, so
> I don't see how they could break any property of an encoding.
> 
> Or did you have the same misunderstanding as I did ? ... 
> Paul is talking about the UTF-8 encoding of the BOM mark ('\xef\xbb\xbf'),
> not the FF FE or FE FF byte sequence as is seen in UTF-16 streams.

So am I, and I think that when decoding UTF-8, the first Unicode
character should be removed when it is the BOM, by the UTF-8 decoder.
It should be removed in that place because it was inserted only to
identify UTF-8 (just as the byte sequence FF FE was inserted into the
UTF-16 stream to identify it as UTF-16, and to identify the byte
order).

I don't think the decoder should remove the BOM from any other
location in the text, since removing it *does* change the content of
the text. It may be removed as part of applying some normalization,
but that should not happen unless the application explicitly requests
that normalization. In fact, none of the Unicode normalization forms
removes the BOM (see TR #15). The BOM is recommended to be a valid
character in identifiers, and it is recommended to remove it before
comparing identifiers (since it is a formatting character).

Regards,
Martin