[Python-3000] Pre-PEP: Easy Text File Decoding
David Hopwood
david.nospam.hopwood at blueyonder.co.uk
Thu Sep 14 01:36:50 CEST 2006
Jason Orendorff wrote:
> On 9/13/06, John S. Yates, Jr. <john at yates-sheets.org> wrote:
>
>>It is a mistake on Microsoft's part to fail to strip the BOM
>>during conversion to UTF-8.
>
> John, you're mistaken about the reason this BOM is here.
>
> In Notepad at least, the BOM is intentionally generated when writing
> the file. It's not a "mistake" or "laziness". It's metadata. (I
> admit the BOM was not originally invented for this purpose.)
>
>>There is no MEANINGFUL definition of BOM in a UTF-8
>>string.
>
> This thread is about files, not strings. At the start of a file, a
> UTF-8 BOM is meaningful. It means the file is UTF-8.
>
> On Windows, there's a system default encoding, and it's never UTF-8.
The Windows system encoding can be UTF-8, but only for some locales
recently added in Windows 2000/XP, where there was no compatibility
constraint to use a non-Unicode encoding.
You're correct about the use of a BOM as a signature. All Unicode-conformant
applications should accept this use of a BOM in UTF-8 (although they need
not generate it); the standard is quite clear on that.
--
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>
More information about the Python-3000
mailing list