[Python-Dev] Improve open() to support reading file starting with an unicode BOM
Victor Stinner
victor.stinner at haypocalc.com
Fri Jan 8 11:40:28 CET 2010
Le vendredi 08 janvier 2010 10:10:23, Martin v. Löwis a écrit :
> > Builtin open() function is unable to open an UTF-16/32 file starting with
> > a BOM if the encoding is not specified (raise an unicode error). For an
> > UTF-8 file starting with a BOM, read()/readline() returns also the BOM
> > whereas the BOM should be "ignored".
>
> It depends. If you use the utf-8-sig encoding, it *will* ignore the
> UTF-8 signature.
Sure, but it means that you only use UTF-8+BOM files. If you get UTF-8 and
UTF-8+BOM files, you have to to detect the encoding (not an easy job) or to
remove the BOM after the first read (much harder if you use a module like
ConfigParser or csv).
> > Since my proposition changes the result TextIOWrapper.read()/readline()
> > for files starting with a BOM, we might introduce an option to open() to
> > enable the new behaviour. But is it really needed to keep the backward
> > compatibility?
>
> Absolutely. And there is no need to produce a new option, but instead
> use the existing options: define an encoding that auto-detects the
> encoding from the family of BOMs. Maybe you call it encoding="sniff".
Good idea, I choosed open(filename, encoding="BOM").
--
Victor Stinner
http://www.haypocalc.com/
More information about the Python-Dev
mailing list