[I18n-sig] UTF-8 and BOM
Paul Prescod
paulp@ActiveState.com
Wed, 16 May 2001 10:32:43 -0700
Notepad always saves UTF-8 documents with a BOM. Visual Studio 7 gives
users an option.
Python 2.1's UTF-8 decoder seems to treat the BOM as a real leading
character. The UTF-16 decoder removes it. I recognize that the BOM is
not useful as a "byte order mark" for UTF-8 data but I would still
suggest that the UTF-8 decoder should remove it for these reasons:
1) Microsoft has taken the stance that a BOM is legal on UTF-8 data
2) Doing so is legal:
"Q: Is the UTF-8 encoding scheme the same irrespective of whether the
underlying processor is little endian or big endian?
A: Yes. Since UTF-8 is interpreted as a sequence of bytes, there is no
endian problem as there is for encoding forms that use 16-bit or 32-bit
code units. Where a BOM is used with UTF-8, it is only to distinguish
UTF-8 from other UTF encodings =97 it has nothing to do with byte order.
[KW]"
http://www.unicode.org/unicode/faq/utf_bom.html
3) I think that distinguising UTF-8 from other encodings through the
BOM is actually a great idea and I wish that every UTF-8 creator would
do it!
4) The behavior would be consistent with the UTF-16 behavior.
----
import codecs
with_bom =3D u"\uFEFFabcd"
utf_8 =3D with_bom.encode("utf-8")
utf_16 =3D with_bom.encode("utf-16")
print repr(codecs.utf_8_decode(utf_8))
(u'\ufeffabcd', 7)
print repr(codecs.utf_16_decode(utf_16))
(u'abcd', 12)
--=20
Take a recipe. Leave a recipe. =20
Python Cookbook! http://www.ActiveState.com/pythoncookbook