[Python-Dev] Improve open() to support reading file starting with an unicode BOM

Fri Jan 8 07:12:12 CET 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Guido van Rossum wrote:
> On Thu, Jan 7, 2010 at 7:34 PM, Glyph Lefkowitz <glyph at twistedmatrix.com> wrote:
>>
>> On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote:
>>
>> On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner
>> <victor.stinner at haypocalc.com> wrote:
>>
>> Hi,
>>
>> Builtin open() function is unable to open an UTF-16/32 file starting with a
>>
>> BOM if the encoding is not specified (raise an unicode error). For an UTF-8
>>
>> file starting with a BOM, read()/readline() returns also the BOM whereas the
>>
>> BOM should be "ignored".
>>
>> I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
>> talk. And for the other two, perhaps it would make more sense to have
>> a separate encoding-guessing function that takes a binary stream and
>> returns a text stream wrapping it with the proper encoding?
>>
>> It *is* crazy, but unfortunately rather common.  Wikipedia has a good
>> description of the issues:
>> <http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark>.  Basically, some
>> Windows text APIs will emit a UTF-8 "BOM" in order to identify the file as
>> being UTF-8, so it's become a convention to do that.  That's not good
>> enough, so you need to guess the encoding as well to make sure, but if there
>> is a BOM and you can otherwise verify that the file is probably UTF-8
>> encoded, you should discard it.
> 
> That doesn't make sense. If the file isn't UTF-8 you can't see the
> BOM, because the BOM itself is UTF-8-encoded.
> 
> (And yes, I know this happens. Doesn't mean we need to auto-guess by
> default; there are lots of issues e.g. what should happen after
> seeking to offset 0?)

The BOM should not be seekeable if the file is opened with the proposed
"guess encoding from BOM" mode:  it isn't properly part of the stream at
all in that case.

A UTF-8 BOM is an absurditiy, but it exists *everywhere* in the wild:
Python would do wll to make it as easy as possible to consume such
files, as well as the non-insane versions (UTF-16 / UTF-32 BOMs).  In
the best of all possible worlds, I would just try opening the file so:

  f = open('/path/to/file', 'r', encoding="DWIFM")

and any BOM present would set the encoding for the remainder of the stream..

Tres.
- --
===================================================================
Tres Seaver          +1 540-429-0999          tseaver at palladion.com
Palladion Software   "Excellence by Design"    http://palladion.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktGzLsACgkQ+gerLs4ltQ5+cwCdGfycPdj6+cPfD23vH644SpHL
sI0AoLGD7nfgMEJdJhBr90yjQQHfDgcJ
=js+2
-----END PGP SIGNATURE-----