[I18n-sig] UTF-8 and BOM

M.-A. Lemburg mal@lemburg.com
Sat, 19 May 2001 12:16:55 +0200

Florian Weimer wrote:
> "M.-A. Lemburg" <mal@lemburg.com> writes:
> > Why should a BOM behave any different than any other Unicode
> > character ? BOMs can be added and deleted in pretty much all
> > places of a Unicode text -- that's their intent after all, so
> > I don't see how they could break any property of an encoding.
> The BOM is overloaded with two meanings, it's certainly not a no-op
> character.

I didn't say that a BOM is a no-op character, just that adding
or removing a BOM character doesn't break the encoding.

For more infos on BOMs and how they are intended to be used,
please see the Unicode FAQ:


The problem with BOMs is that they are supposed to appear at
the start of a string. However, if you concatenate two such
strings, the BOM in the middle will turn into a normal
ZWNBSP character. 

To be fully standards compliant, string concat
of a UTF-16 string (which start with BOM marks) would have
to be special cased. This is not possible though, since
strings don't have any encoding information.

The only way to properly deal with all this is at application
level, since only the programmer knows which string will
actually form the start of a file or a larger text string.

What I could do, is add a UTF-8 codec which prepends a
BOM mark and removes it from the stream during decode. The
programmer would have to do use this codec in case she
wants to prepend UTF-8 files with a BOM then.

I'm still unsure whether I should change the UTF-16 decoder
to only remove the BOM at the start of the stream -- the above
case where BOMs are inserted due to string concatenation
is very common (each .write() to a file will produce such
a BOM mark).

Marc-Andre Lemburg
CEO eGenix.com Software GmbH
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/