[Python-Dev] Improve open() to support reading file starting with an unicode BOM
MRAB
python at mrabarnett.plus.com
Fri Jan 8 17:47:18 CET 2010
Victor Stinner wrote:
> Le vendredi 08 janvier 2010 05:21:04, Guido van Rossum a écrit :
> (...)
>> (And yes, I know this happens. Doesn't mean we need to auto-guess by
>> default; there are lots of issues e.g. what should happen after
>> seeking to offset 0?)
>
> I wrote a new version of my patch (version 3):
>
> * don't change the default behaviour: use open(filename, encoding="BOM") to
> check the BOM is there is any
> * fix for seek(0): always ignore the BOM
> * add an unit test: check that the right encoding is detect, but also the the
> BOM is ignored (especially after a seek(0))
>
> BOM encoding doesn't work for writing into a file, so open(filename, "w",
> encoding="BOM") raises a ValueError.
>
I think it's similar to universal newline mode. You can tell it that
you're reading UTF-something-encoded text (common forms only).
The preference is UTF-8, but it could be UTF-8-sig (from Windows), or
possibly UTF-16/32, which really need a BOM because there are multiple
bytes per codepoint, so the bytes could be big-endian or little-endian.
The BOM (or signature) tells you what the encoding is, defaulting to
UTF-8 if there's none. If it subsequently raises a DecodeError, then
so be it!
Maybe there should also be a way of determining what encoding it decided
it was, so that you can then write a new file in that same encoding.
More information about the Python-Dev
mailing list