[I18n-sig] UTF-8 and BOM

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Thu, 17 May 2001 06:32:24 +0200


> Text data is different than binary data. Unicode text
> which uses combining characters (e.g. accent and 'e' to produce
> 'é') is equivalent to text which uses the combined character
> point directly. 

Are you saying that the BOM is removed under normalization? Which
normalization form?

> You have to be careful here: UTF-16 prepends a BOM mark to
> every string pushed through the codec -- even small snippets.

That seems like an error also. When writing to a UTF-16 stream, I want
the BOM to appear only in the first bytes of the resulting file.

Regards,
Martin