[I18n-sig] UTF-8 and BOM
Walter Doerwald
walter@livinglogic.de
Mon, 21 May 2001 13:08:34 +0200
On 21.05.01 at 11:06 Toby Dickenson wrote:
> [...]
> >it is absurd to
> >expect code dealing with *strings* to handle BOMs.
>
> I agree with that, and is a good reason why the codecs should always
> remove them.
??? This is a good reason why the codec should pass the \ufeff
through, because a \ufeff in a unicode object should not be
considered to be a BOM but a ZWNBSP (it might e.g. be used to
give hints to a hyphenation or ligature algorithm.)
> "M.-A. Lemburg" <mal@lemburg.com> wrote:
>
> >I'm still unsure whether I should change the UTF-16 decoder
> >to only remove the BOM at the start of the stream -- the above
> >case where BOMs are inserted due to string concatenation
> >is very common (each .write() to a file will produce such
> >a BOM mark).
Then the write function has an error. A BOM should only be
written at the start of the file and not on every call to
write().
The Unicode FAQ (http://www.unicode.org/unicode/faq/utf_bom.html#24)
states:
Q: I am using a protocol that has BOM at the start of text.
How do I represent an initial ZWNBSP?
A: Use the sequence FEFF FEFF
But with the current decoder implementation *both* \ufeffs
will be removed, so the ZWNBSP disappears.
Bye,
Walter D=F6rwald
--
Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7
www.livinglogic.de