[Python-3000] Pre-PEP: Easy Text File Decoding
Jason Orendorff
jason.orendorff at gmail.com
Wed Sep 13 20:23:33 CEST 2006
On 9/13/06, John S. Yates, Jr. <john at yates-sheets.org> wrote:
> It is a mistake on Microsoft's part to fail to strip the BOM
> during conversion to UTF-8.
John, you're mistaken about the reason this BOM is here.
In Notepad at least, the BOM is intentionally generated when writing
the file. It's not a "mistake" or "laziness". It's metadata. (I
admit the BOM was not originally invented for this purpose.)
> There is no MEANINGFUL definition of BOM in a UTF-8
> string.
This thread is about files, not strings. At the start of a file, a
UTF-8 BOM is meaningful. It means the file is UTF-8.
On Windows, there's a system default encoding, and it's never UTF-8.
Notepad writes the BOM so that later, when you open the file in
Notepad again, it can identify the file as UTF-8.
> You can see the logical fallacy if you imagine emitting UTF-16
> text in an environment of one byte sex, reducing that text to
> UTF-8, carrying it to an environment of the other byte sex and
> raising it back to UTF-16.
It sounds as if you think this will corrupt the BOM, but it works fine:
>>> import codecs
# "Emitting UTF-16 text" in little-endian environment
>>> s1 = codecs.BOM_UTF16_LE + u'hello world'.encode('utf-16-le')
# "Reducing that text to UTF-8"
>>> s2 = s1.decode('utf-16-le').encode('utf-8')
>>> s2
'\xef\xbb\xbfhello world'
# "Raising it back to UTF-16" in big-endian environment
>>> s3 = s2.decode('utf-8').encode('utf-16-be')
>>> s3[:2] == codecs.BOM_UTF16_BE
True
The BOM is still correct: the data is UTF-16-BE, and the BOM agrees.
A UTF-8 string or file will contain exactly the same bytes (including
the BOM, if any) whether it is generated from UTF-16-BE or -LE. All
three are lossless representations in bytes of the same abstract
ideal, which is a sequence of Unicode codepoints.
-j
More information about the Python-3000
mailing list