[Python-Dev] Improve open() to support reading file starting with an unicode BOM

Fri Jan 8 08:55:27 CET 2010

On Jan 7, 2010, at 11:21 PM, Guido van Rossum wrote:

> On Thu, Jan 7, 2010 at 7:34 PM, Glyph Lefkowitz <glyph at twistedmatrix.com> wrote:
>> 
>> On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote:
>>> 
>>> I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
>>> talk. And for the other two, perhaps it would make more sense to have
>>> a separate encoding-guessing function that takes a binary stream and
>>> returns a text stream wrapping it with the proper encoding?
>> 
>> It *is* crazy, but unfortunately rather common.  Wikipedia has a good
>> description of the issues:
>> <http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark>.  Basically, some
>> Windows text APIs will emit a UTF-8 "BOM" in order to identify the file as
>> being UTF-8, so it's become a convention to do that.  That's not good
>> enough, so you need to guess the encoding as well to make sure, but if there
>> is a BOM and you can otherwise verify that the file is probably UTF-8
>> encoded, you should discard it.
> 
> That doesn't make sense. If the file isn't UTF-8 you can't see the
> BOM, because the BOM itself is UTF-8-encoded.

I'm saying that the BOM itself isn't enough to detect that the file is actually UTF-8.  If (for whatever reason: explicitly specified, guessed in some other way) the file's encoding is determined to be something else, the bytes comprising the BOM should be decoded as normal.  It's just that the UTF-8 decoding of the BOM at the start of a file should be "".

> (And yes, I know this happens. Doesn't mean we need to auto-guess by
> default; there are lots of issues e.g. what should happen after
> seeking to offset 0?)

I think it's pretty clear that the BOM should still be skipped in that case ...