[Python-3000] Pre-PEP: Easy Text File Decoding
Marcin 'Qrczak' Kowalczyk
qrczak at knm.org.pl
Wed Sep 13 15:37:05 CEST 2006
"John S. Yates, Jr." <john at yates-sheets.org> writes:
> It is a mistake on Microsoft's part to fail to strip the BOM
> during conversion to UTF-8. There is no MEANINGFUL definition
> of BOM in a UTF-8 string. But instead of stripping the wrapper
> and converting only the text payload Microsoft lazily treats
> both the wrapper and its payload as text.
The Unicode standard is at fault too.
It specifies UTF-16 and UTF-32 in variants:
- UTF-{16,32} with an optional BOM (defaulting to big endian if the
BOM is not present), where the BOM is mandatory if the first
character of the contents is U+FEFF (otherwise it would be mistaken
as a BOM).
- UTF-{16,32}{LE,BE} with a fixed endianness and without a BOM;
a U+FEFF in UTF-16BE must not be interpreted as a BOM, it's always
a part of the text.
The problem is that it's not clear in the case of UTF-8. Formally it
doesn't have a BOM, but the standard includes some ambiguous wording
that various software uses UTF-8 BOM and the presence of a BOM should
not affect the interpretation. It should clearly distinguish two
interpretations of UTF-8: one without the concept of a BOM, and one
which permits the BOM (and in fact makes it mandatory if the stream
begins with U+FEFF).
--
__("< Marcin Kowalczyk
\__/ qrczak at knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/
More information about the Python-3000
mailing list