[Python-3000] Pre-PEP: Easy Text File Decoding

Wed Sep 13 15:37:05 CEST 2006

"John S. Yates, Jr." <john at yates-sheets.org> writes:

> It is a mistake on Microsoft's part to fail to strip the BOM
> during conversion to UTF-8.  There is no MEANINGFUL definition
> of BOM in a UTF-8 string.  But instead of stripping the wrapper
> and converting only the text payload Microsoft lazily treats
> both the wrapper and its payload as text.

The Unicode standard is at fault too.

It specifies UTF-16 and UTF-32 in variants:

- UTF-{16,32} with an optional BOM (defaulting to big endian if the
  BOM is not present), where the BOM is mandatory if the first
  character of the contents is U+FEFF (otherwise it would be mistaken
  as a BOM).

- UTF-{16,32}{LE,BE} with a fixed endianness and without a BOM;
  a U+FEFF in UTF-16BE must not be interpreted as a BOM, it's always
  a part of the text.

The problem is that it's not clear in the case of UTF-8. Formally it
doesn't have a BOM, but the standard includes some ambiguous wording
that various software uses UTF-8 BOM and the presence of a BOM should
not affect the interpretation. It should clearly distinguish two
interpretations of UTF-8: one without the concept of a BOM, and one
which permits the BOM (and in fact makes it mandatory if the stream
begins with U+FEFF).

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/