[Python-Dev] Improve open() to support reading file starting with an unicode BOM

Sat Jan 9 12:18:33 CET 2010

Victor Stinner wrote:
> Le vendredi 08 janvier 2010 10:10:23, Martin v. Löwis a écrit :
>>> Builtin open() function is unable to open an UTF-16/32 file starting with
>>> a BOM if the encoding is not specified (raise an unicode error). For an
>>> UTF-8 file starting with a BOM, read()/readline() returns also the BOM
>>> whereas the BOM should be "ignored".
>> It depends. If you use the utf-8-sig encoding, it *will* ignore the
>> UTF-8 signature.
> 
> Sure, but it means that you only use UTF-8+BOM files. If you get UTF-8 and 
> UTF-8+BOM files, you have to to detect the encoding (not an easy job) or to 
> remove the BOM after the first read (much harder if you use a module like 
> ConfigParser or csv).
> 
>>> Since my proposition changes the result TextIOWrapper.read()/readline()
>>> for files starting with a BOM, we might introduce an option to open() to
>>> enable the new behaviour. But is it really needed to keep the backward
>>> compatibility?
>> Absolutely. And there is no need to produce a new option, but instead
>> use the existing options: define an encoding that auto-detects the
>> encoding from the family of BOMs. Maybe you call it encoding="sniff".
> 
> Good idea, I choosed open(filename, encoding="BOM").

On the surface this looks like there's an encoding named "BOM", but 
looking at your patch I found that the check is still done in 
TextIOWrapper. IMHO the best approach would to the implement a *real* 
codec named "BOM" (or "sniff"). This doesn't require *any* changes to 
the IO library. It could even be developed as a standalone project and 
published in the Cheeseshop.

To see how something like this can be done, take a look at the UTF-16 
codec, that switches to bigendian or littleendian mode depending on the 
first read/decode call.

Servus,
    Walter