[I18n-sig] UTF-8 and BOM

Walter Doerwald walter@livinglogic.de
Mon, 21 May 2001 13:08:34 +0200


On 21.05.01 at 11:06 Toby Dickenson wrote:

> [...]
> >it is absurd to
> >expect code dealing with *strings* to handle BOMs.
> 
> I agree with that, and is a good reason why the codecs should always
> remove them.

??? This is a good reason why the codec should pass the \ufeff
through, because a \ufeff in a unicode object should not be 
considered to be a BOM but a ZWNBSP (it might e.g. be used to
give hints to a hyphenation or ligature algorithm.)

> "M.-A. Lemburg" <mal@lemburg.com> wrote:
> 
> >I'm still unsure whether I should change the UTF-16 decoder
> >to only remove the BOM at the start of the stream -- the above
> >case where BOMs are inserted due to string concatenation
> >is very common (each .write() to a file will produce such
> >a BOM mark).

Then the write function has an error. A BOM should only be
written at the start of the file and not on every call to
write().

The Unicode FAQ (http://www.unicode.org/unicode/faq/utf_bom.html#24)
states:
   Q: I am using a protocol that has BOM at the start of text. 
      How do I represent an initial ZWNBSP?

   A: Use the sequence FEFF FEFF

But with the current decoder implementation *both* \ufeffs
will be removed, so the ZWNBSP disappears.



Bye,
   Walter D=F6rwald

-- 
Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7
www.livinglogic.de