[Python-Dev] Improve open() to support reading file starting with an unicode BOM

Fri Jan 8 17:47:18 CET 2010

Victor Stinner wrote:
> Le vendredi 08 janvier 2010 05:21:04, Guido van Rossum a écrit :
> (...)
>> (And yes, I know this happens. Doesn't mean we need to auto-guess by
>> default; there are lots of issues e.g. what should happen after
>> seeking to offset 0?)
> 
> I wrote a new version of my patch (version 3):
> 
>  * don't change the default behaviour: use open(filename, encoding="BOM") to 
> check the BOM is there is any
>  * fix for seek(0): always ignore the BOM
>  * add an unit test: check that the right encoding is detect, but also the the 
> BOM is ignored (especially after a seek(0))
> 
> BOM encoding doesn't work for writing into a file, so open(filename, "w", 
> encoding="BOM") raises a ValueError.
> 
I think it's similar to universal newline mode. You can tell it that
you're reading UTF-something-encoded text (common forms only).

The preference is UTF-8, but it could be UTF-8-sig (from Windows), or
possibly UTF-16/32, which really need a BOM because there are multiple
bytes per codepoint, so the bytes could be big-endian or little-endian.

The BOM (or signature) tells you what the encoding is, defaulting to
UTF-8 if there's none. If it subsequently raises a DecodeError, then
so be it!

Maybe there should also be a way of determining what encoding it decided
it was, so that you can then write a new file in that same encoding.