Unicode BOM marks
francis.girard at free.fr
Mon Mar 7 20:24:42 CET 2005
For the first time in my programmer life, I have to take care of character
encoding. I have a question about the BOM marks.
If I understand well, into the UTF-8 unicode binary representation, some
systems add at the beginning of the file a BOM mark (Windows?), some don't.
(Linux?). Therefore, the exact same text encoded in the same UTF-8 will
result in two different binary files, and of a slightly different length.
I guess that this leading BOM mark are special marking bytes that can't be, in
no way, decoded as valid text.
(I really really hope the answer is yes otherwise we're in hell when moving
file from one platform to another, even with the same Unicode encoding).
I also guess that this leading BOM mark is silently ignored by any unicode
aware file stream reader to which we already indicated that the file follows
the UTF-8 encoding standard.
If so, is it the case with the python codecs decoder ?
In python documentation, I see theseconstants. The documentation is not clear
to which encoding these constants apply. Here's my understanding :
BOM : UTF-8 only or UTF-8 and UTF-32 ?
BOM_BE : UTF-8 only or UTF-8 and UTF-32 ?
BOM_LE : UTF-8 only or UTF-8 and UTF-32 ?
BOM_UTF8 : UTF-8 only
BOM_UTF16 : UTF-16 only
BOM_UTF16_BE : UTF-16 only
BOM_UTF16_LE : UTF-16 only
BOM_UTF32 : UTF-32 only
BOM_UTF32_BE : UTF-32 only
BOM_UTF32_LE : UTF-32 only
Why should I need these constants if codecs decoder can handle them without my
help, only specifying the encoding ?
Python tells me to use an encoding declaration at the top of my files (the
message is referring to http://www.python.org/peps/pep-0263.html).
I expected to see there a list of acceptable
More information about the Python-list