[I18n-sig] UTF-8 and BOM

M.-A. Lemburg mal@lemburg.com
Mon, 21 May 2001 13:45:46 +0200

Walter Doerwald wrote:
> On 21.05.01 at 11:06 Toby Dickenson wrote:
> > [...]
> > >it is absurd to
> > >expect code dealing with *strings* to handle BOMs.
> >
> > I agree with that, and is a good reason why the codecs should always
> > remove them.
> ??? This is a good reason why the codec should pass the \ufeff
> through, because a \ufeff in a unicode object should not be
> considered to be a BOM but a ZWNBSP (it might e.g. be used to
> give hints to a hyphenation or ligature algorithm.)

> > "M.-A. Lemburg" <mal@lemburg.com> wrote:
> >
> > >I'm still unsure whether I should change the UTF-16 decoder
> > >to only remove the BOM at the start of the stream -- the above
> > >case where BOMs are inserted due to string concatenation
> > >is very common (each .write() to a file will produce such
> > >a BOM mark).
> Then the write function has an error. A BOM should only be
> written at the start of the file and not on every call to
> write().

That's hard to implement... how would the codec know where the
stream starts -- it only interfaces to the underyling stream
using .read() and .write() ?
> The Unicode FAQ (http://www.unicode.org/unicode/faq/utf_bom.html#24)
> states:
>    Q: I am using a protocol that has BOM at the start of text.
>       How do I represent an initial ZWNBSP?
>    A: Use the sequence FEFF FEFF
> But with the current decoder implementation *both* \ufeffs
> will be removed, so the ZWNBSP disappears.

Note that this only happens in the UTF-16 codec. All other
codecs pass through the BOMs as-is. Perhaps I should modify
the UTF-16 codec to only remove BOMs when used in UTF-16
mode (without byte order indication) and not in 
UTF-16-LE/UTF-16-BE mode ?! ... and then only at the
start of a string.

Marc-Andre Lemburg
CEO eGenix.com Software GmbH
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/