[I18n-sig] UTF-8 and BOM

Paul Prescod paulp@ActiveState.com
Wed, 16 May 2001 10:32:43 -0700


Notepad always saves UTF-8 documents with a BOM. Visual Studio 7 gives
users an option.

Python 2.1's UTF-8 decoder seems to treat the BOM as a real leading
character. The UTF-16 decoder removes it. I recognize that the BOM is
not useful as a "byte order mark" for UTF-8 data but I would still
suggest that the UTF-8 decoder should remove it for these reasons:

 1) Microsoft has taken the stance that a BOM is legal on UTF-8 data

 2) Doing so is legal:

"Q: Is the UTF-8 encoding scheme the same irrespective of whether the
underlying processor is little endian or big endian?

A: Yes. Since UTF-8 is interpreted as a sequence of bytes, there is no
endian problem as there is for encoding forms that use 16-bit or 32-bit
code units. Where a BOM is used with UTF-8, it is only to distinguish
UTF-8 from other UTF encodings =97 it has nothing to do with byte order.
[KW]"

http://www.unicode.org/unicode/faq/utf_bom.html

 3) I think that distinguising UTF-8 from other encodings through the
BOM is actually a great idea and I wish that every UTF-8 creator would
do it!

 4) The behavior would be consistent with the UTF-16 behavior.

----
import codecs

with_bom =3D u"\uFEFFabcd"
utf_8 =3D with_bom.encode("utf-8")
utf_16 =3D with_bom.encode("utf-16")

print repr(codecs.utf_8_decode(utf_8))
(u'\ufeffabcd', 7)

print repr(codecs.utf_16_decode(utf_16))
(u'abcd', 12)


--=20
Take a recipe. Leave a recipe. =20
Python Cookbook!  http://www.ActiveState.com/pythoncookbook