[I18n-sig] UTF-8 and BOM

Walter Doerwald walter@livinglogic.de
Mon, 21 May 2001 13:08:34 +0200

On 21.05.01 at 11:06 Toby Dickenson wrote:

> [...]
> >it is absurd to
> >expect code dealing with *strings* to handle BOMs.
> I agree with that, and is a good reason why the codecs should always
> remove them.

??? This is a good reason why the codec should pass the \ufeff
through, because a \ufeff in a unicode object should not be 
considered to be a BOM but a ZWNBSP (it might e.g. be used to
give hints to a hyphenation or ligature algorithm.)

> "M.-A. Lemburg" <mal@lemburg.com> wrote:
> >I'm still unsure whether I should change the UTF-16 decoder
> >to only remove the BOM at the start of the stream -- the above
> >case where BOMs are inserted due to string concatenation
> >is very common (each .write() to a file will produce such
> >a BOM mark).

Then the write function has an error. A BOM should only be
written at the start of the file and not on every call to

The Unicode FAQ (http://www.unicode.org/unicode/faq/utf_bom.html#24)
   Q: I am using a protocol that has BOM at the start of text. 
      How do I represent an initial ZWNBSP?

   A: Use the sequence FEFF FEFF

But with the current decoder implementation *both* \ufeffs
will be removed, so the ZWNBSP disappears.

   Walter D=F6rwald

Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7