[Python-Dev] Improve open() to support reading file starting with an unicode BOM

Mon Jan 11 18:27:01 CET 2010

On Mon, Jan 11, 2010 at 18:16, "Martin v. Löwis" <martin at v.loewis.de> wrote:
>> But an autodetect feature is not a codec. Sure it should be reusable,
>> but making it a codec seems to be  a weird hack to me.
>
> Well, the existing UTF-16 codec also is an autodetect feature (to
> detect the endianness), and I don't consider it a weird hack.

So the BOM codec should raise a UnicodeDecodeError if there is no BOM?
Because that's what it would have to do, in that case, because it
can't fall back on anything, it has to handle and implement all
encodings that have a BOM. And is it then actually very useful? You
would have to do a try/except first with encoding='BOM' and then
encoding=None to get the fallback to the standard.

I must say that I find this whole thing pretty obvious. 'BOM' is not
an encoding. Either there should be a method to get the encoding from
the BOM, returning None of there isn't one, or open() should look at
the BOM when you pass in encoding=None. Or both.

That covers all usecases, is easy and obvious. Either open(file=foo,
encoding=None) or open(file, encoding=encoding_from_bom(file))

I can't see that open(file, encoding='BOM') has any benefit over this,
covers any extra usecase and is clearer in any way. Instead it adds
something confusing: An encoding that isn't an encoding.

-- 
Lennart Regebro: Python, Zope, Plone, Grok
http://regebro.wordpress.com/
+33 661 58 14 64