[Python-3000] Pre-PEP: Easy Text File Decoding
walter at livinglogic.de
Thu Sep 14 00:05:31 CEST 2006
Jason Orendorff wrote:
> On 9/13/06, John S. Yates, Jr. <john at yates-sheets.org> wrote:
>> It is a mistake on Microsoft's part to fail to strip the BOM
>> during conversion to UTF-8.
> John, you're mistaken about the reason this BOM is here.
> In Notepad at least, the BOM is intentionally generated when writing
> the file. It's not a "mistake" or "laziness". It's metadata. (I
> admit the BOM was not originally invented for this purpose.)
In theory it's only metadata if external information says that it is, it
practice it's unlikely that a charmap encoded file begins with these
three bytes. nevertheless it's only a hint.
>> There is no MEANINGFUL definition of BOM in a UTF-8
> This thread is about files, not strings. At the start of a file, a
> UTF-8 BOM is meaningful. It means the file is UTF-8.
... and the first "character" in the file is U+FEFF. If you want the
codec to drop the BOM on reading, use the UTF-8-Sig codec.
More information about the Python-3000